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Abstract 

We address the problem of universal communications over an unknown channel with an instanta- 
neous noiseless feedback, and show how rates corresponding to the empirical behavior of the channel 
can be attained, although no rate can be guaranteed in advance. First, we consider a discrete modulo- 
additive channel with alphabet X, where the noise sequence Z" is arbitrary and unknown and may 
causally depend on the transmitted and received sequences and on the encoder's message, possibly 
in an adversarial fashion. Although the classical capacity of this channel is zero, we show that rates 
approaching the empirical capacity \og\X\ — Hf.ra-p{Z") can be universally attained, where Hsrap{Z") 
is the empirical entropy of Z". For the more general setting where the channel can map its input to 
an output in an arbitrary unknown fashion subject only to causality, we model the empirical channel 
actions as the modulo-addition of a realized noise sequence, and show that the same result applies if 
common randomness is available. The results are proved constructively, by providing a simple sequen- 
tial transmission scheme approaching the empirical capacity. In part II of this work we demonstrate 
how even higher rates can be attained by using more elaborate models for channel actions, and by 
utilizing possible empirical dependencies in its behavior. 

Index Terms - Feedback Communications, Universal Communications, Arbitrarily Varying Channels, 
Adversarial Channels, Individual Sequences 
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I INTRODUCTION 



I Introduction 

The capacity of a channel is classically defined as the supremum of all rates for which communication with 
arbitrarily low probability of error can be guaranteed in advance. However, when a noiseless feedback 
link between the receiver and the transmitter exist, one does not necessarily have to commit to a rate 
prior to transmission, and communication can take place using some sequential scheme at a variable rate 
determined by the specific realization of the channel, thus the better the channel realization the higher the 
rate of transmission. When the channel law is known this approach cannot yield average rates exceeding 
those attainable by fixed rate feedback schemes, and for large classes of channels cannot even exceed the 
rates of non-feedback schemes The variablc-ratc approach may, however, have the advantages of 

a better error exponent and a lower complexity. Several transmission schemes possessing such merits were 
proposed for the binary symmetric channel (BSC) [4] [5], the Gaussian additive noise channel [6] [7], discrete 
memorylcss channels (DMC) [8] [7] and finite-state channels (FSC) [9]. 

When the channel law is unknown to some degree, variable rate feedback schemes become even more 
attractive, as the realized channel may sometimes be explicitly or implicitly estimated via feedback. In 
[10| , a rate universal scheme for unknown DMC with a random decision time was suggested (later termed 
rateless coding), attaining a rate equal to the mutual information of the channel in use for any selected 
input distribution. Following this lead, a universal variable rate transmission schemes for compound BSC 
and Z-channels with feedback was introduced |12j , and shown to attain any fraction of the realized channel's 
capacity and achieve the Burnashev error exponent |13j . In |I4| it was shown that for compound FSC with 
feedback, it is possible to transmit at a rate approaching the mutual information of the realized channel for 
any Markov input distribution, by an incremental universal compression of the errors, e.g. via Lempel-Ziv 
coding. 

So far, however, the variable-rate approach was not applied to more stringent channel uncertainty 
models, where the channel behavior is arbitrary or even adversarial. As a motivating example, consider 
a binary modulo-additive channel with feedback, where the noise sequence is an individual sequence (i.e., 
deterministic and unknown). Let us assume (at first) that the fraction of 'I's in the noise sequence (namely 
the fraction of errors inserted by the channel) is a-priori known to be p S [0, 1] at the most. The fixed 
rate communication problem in this setting has been addressed before in several different contexts. In 
a classical work [15], Berlekamp considered this model in the context of error correction capability with 
feedback, where the receiver is required to correct the errors inserted by the channel and uniquely recover 
the transmitted message. Since no decoding errors are allowed, the noise sequence in this case can also 
be thought of as being generated by an adversary that knows the message and the coding scheme, but is 
"power limited" by p. Berlekamp showed that whenever p > ^, there exists an adversarial strategy for 
error insertion such that the receiver cannot hope to separate even three messages, and so the capacity is 
zero. For smaller p, he was able to show that the capacity is upper bounded bjo 1 — hgi^p) and a (tight) 
straight line tangent to 1 — hg {p) , intersecting the horizontal axis at p = ^ . The convex part of this bound 
was later shown to be tight as well [16] . 

The same communication problem can also be studied in the context of the (discrete memoryless) 
Arbitrarily Varying Channel (AVC). In an AVC setting, a memoryless channel law is selected from a given 
set (state space) at each time point, in an arbitrary unknown manner. The AVC without feedback was 
studied extensively [17] [THj [IH] , and shown to yield different capacities depending on the error criterion 
(aver age /maximum error probability) and also on the existence of common randomness (resulting in the 
so-called random- code capacity, which is the same under both error criteria). Within the AVC framework, 
the channel under discussion is a binary AVC with two states, a clean channel and an inverting channel, 
where the noise sequence becomes the state sequence and the maximal fraction of channel errors p yields 
a state constraint. Under the maximum error probability criterion, this AVC with feedback is equivalent 
to Berlekamp's setting, the capacity of which was given above. Interestingly, it turns out that even 
without feedback, the random-code capacity of this channel is given by 1 — hgip) for any p < \ (and zero 
otherwise jl and can be attained with merely O(logn) bits of common randomness [1H][2Q]. This small 
amount of randomness can be generated via feedback with a negligible impact on rate, hence the capacity 
of the discussed binary channel with feedback coincides with its (non-feedback) random-code capacity, 
yielding a significant gain relative to Berlekamp's capacity through the use of randomness. This approach 

^/ib(-) is the binary entropy function. 

■^In fact, this is also the deterministic coding capacity without feedback, under the average error probability criterion. 
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of extending non-feedback AVC results to the feedback regime was also taken in [21] (albeit without state 
constraints) where it was shown that the feedback capacity of an AVC is equal to its random-code capacity. 

Bcrlckamp's result and the AVC approach are limited by the requirement to commit to a fixed rate 
prior to transmission, and so positive rates are obtained only under noisc/statc sequence constraints. As 
we shall see, variable-rate coding can be used to obtain a much stronger result that applies to any noise 
sequence without any constraint (i.e., p = 1). As a corollary of our main result, wc constructively show that 
for the binary channel under discussion, rates arbitrarily close to 1 — /lij (pcmp) can be achieved by a simple 
(deterministic, algorithmic) sequential feedback scheme with probability approaching one and a vanishing 
(maximum) error probabilitjH, where Pomp is the empirical fraction of 'I's in the individual noise sequence. 
Thus, although the fixed-rate capacity is zero when there are no constraints on the noise sequence, one 
can opportunistically attain rates approaching what would have been the capacity of the channel had the 
fraction of 'I's in the noise sequence been known in advance. It is therefore only appropriate to call the 
quantity 1 — /iB(pcmp) the (zero-order, modulo-additive) empirical capacity of the realized channel. 

More generally, in this paper wc consider a discrete channel with feedback over a common input /output 
alphabet A", that maps its input to an output in a modulo-additive fashion, where the corresponding noise 
sequence is arbitrary and unknown and may causality depend on the transmitted and received sequences 
and on the encoder's message. Wc constructively show that rates arbitrarily close to the (zero-order, 
modulo-additive) empirical capacity log | A"! — iJomp(-^") can be achieved by a simple sequential scheme with 
probability approaching one, where i7emp(^") is the (zero-order) empirical entropy of the noise sequence 
Z". Furthermore, we consider the more general setting where the channel can map its input to an output 
in an arbitrary unknown fashion (not necessarily modulo-additive), subject only to causality. By modelling 
the channel actions as the modulo-addition of a realized noise sequence, we show that the corresponding 
empirical capacity can be achieved, if common randomness is allowed. These channel models can also 
be interpreted as adversarial, where an adversary (jammer) that knows the transmission scheme and is 
in possession of the transmitted message, causally listens to the transmitted and received sequences and 
employs an arbitrary unknown jamming strategy. 

The paper is organized as follows. In section |TT] some notations and useful Lemmas are given. The 
channel model and the main result of the paper arc provided in section IIIII A finite-horizon feedback 
transmission scheme achieving the empirical capacity in a modulo-additive setting is described in section 
llVi and its analysis appears in section |Vl A short discussion is provided in section IVll The horizon- free 
variant of the scheme appears in Appendix [Bl and its extension to general causal channels under the 
modulo-additive model using common randomness, is discussed in Appendix [Cl Part II of this work |22| 
is dedicated to the investigation of more elaborate models for channel actions, and the corresponding gain 
in rate that may be achieved by utilizing empirical dependencies in the channel's behavior. 



II Notations and Preliminaries 



The following standard asymptotic notations are used: 
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For ri e N we use the convention (n) = {0, 1, . . . , ?i — 1}. All logarithms arc taken to the base of 2. For 
vectors, we write = (zm, z^+i ■ ■ ■ , Zn) which by convention is the null string if m > n, and use z" ~ z" 
for short. For real valued vectors, || ■ Hoc is the £oo norm. Random variable (r.v's) are usually denoted by 
uppercase letters, with the corresponding lower-case letters for realizations. We write Hi-) for the entropy 
function, /lij(-) for the binary entropy function, and -D(-||-) for relative entropy. A finite alphabet X in this 



^The probabilities in this case are taken w.r.t. randomness created via the feedback link. 
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paper is taken to be the set X = {\X\) associated with the modulo- addition operator +, unless otherwise 
stated. 

Lemma 1 (Entropy £00 bound). Let p be a probability distribution over a finite alphabet X. Then 

H{p) > log 1^-1(1-1^-111^-^^1 
where p,j is the uniform distribution over X. 

Proof. Sec Appendix]^ □ 

For a sequence 2" e X^\ the number of occurrences of the symbol i G X is denoted by ni(z"). The 
empirical distribution of z" is the vector of relative symbol occurrences in z", 



Pcmp(^ ) 



where by convention, the empirical distribution of a null string is taken to be uniform. When z" is a binary 
sequence, we write Pempiz") for its empirical fractions of 'I's, and loosely refer to Pemp{z") as the empirical 
distribution of z". The (zero-order) empirical entropy of z" is H {p^j^-^p(z")) , the entropy pertaining to the 
empirical distribution, and is denoted by Hcmpiz") for short. For a binary sequence, the empirical entropy 
is written hB{pernp{z")). 

A sequential probability estimator over a finite alphabet X is a sequence of nonnegative functions 
{pk{-\z''~^)^ which sum to unity for any k G N,z'''^^ G X^~^ . As usual, the function pk{-\z^~^) is 
thought of as a probability assignment for the next symbol given past observations z^~^ . The probability 
assigned by the sequential estimator to any finite individual sequence z" is therefore defined as 

n 

p{z") ^ llUzklz''-') , z"ex" 

fe=l 

We would also be interested in the following quantity, 

n 

p(z"||u;") ^ l[pk{zk\w''~') , z",w"eX" 

k=l 

which is the probability assigned to the individual sequence z" by a sequential estimator matched to a 
different individual sequence w", namely the case of sequential estimation from noisy observations. 
A well known probability estimator is the Krichevsky-Trofimov (KT) estimator [23j given by 

The following Lemma shows that the per-symbol codelength assigned by the KT estimator to any individual 
sequence is close to its empirical entropy. 

Lemma 2 (KT redundancy |24| ) . For any individual sequence z" G X" , 

\X\ - l)logn + 0(l) 



-l0gp^^"(z")<Hemp(;^") + 



2n 



A sequential probability estimator is said to be a KT(b) estimator if it can be obtained from a KT 
estimator by updating the estimates at least once per b symbols. Such an estimator is given by 

p^^^(^\z^-^)^pri^m. A- 

where {j^fcl^^ is a nondecreasing index sequence determining the positions where the KT estimates are 
updated, thus satisfying fc — 6 < I'fc < fc — 1. In the sequel, we will be interested in the excess redundancy 
incurred when using a KT(6) estimator in lieu of the KT estimator, and possibly when the estimator is 
matched to a different individual sequence, i.e., the case of noisy observations. 
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Lemma 3 (Noisy KT(6) excess redundancy). Letp^'^"'^ be a KT(b) estimator for some 6 € N. Then for 
any pair of individual sequences z", w" € A"", 

- log ^ ^^^^J^ ' < 2\X\ {b + d{z\w-) - 1) log 2ne (1) 

where d{-, •) is the Hamming distance operator. 

Proof. See Appendix [X] □ 

Let 2" G A"", 6" € {0,1}", and let crfe(6") be the index of the kth nonzero element in Define 
z"! 6" g A'"!'^'' ) to be the vector whose fcth element is Zcr^(b")^ i-e-, ^"'i 6" is a sample of size ni{b^) of 

where sampling positions are determined by the 'I's in 6". The next Lemma bounds the probability 
that the deviation (measured in the £00 norm) between the empirical distributions of a r.v. sequence, and 
that of a fixed-size random sample without replacement from that sequence, exceeds some threshold. It is 
a direct consequence of a result by Hocffding 

Lemma 4 (Sampling without replacement). Let Z'^,B^ be a pair of statistically independent r.v. se- 
quences, where takes values in a finite alphabet X , and _B" is uniformly distributed over the set 
{6" e {0, 1}" : ni(6") = m}. Then for any t > 



IP(||Pomp(^")-Pe.np(^"ii?")|L>Tj < 2\X\ e^p {-2wt') (2) 

Proof. See Appendix [X] □ 

The following Lemma is an analogue of Lemma 0] for causally independent sampling, and is a direct 
consequence of the Azuma-Hoeffding inequality for bounded-difference martingales |26j . 

Lemma 5 (Causally independent sampling). Let Z^^,B^^ be a pair of r.v. sequences , where takes 
values in a finite alphabet X. Suppose Bk+i ^ Ber((7) and is statistically independent of {B'^ , Z'^^^) for 
any k € (n). Then for any t > 0, 

IP(||Pemp(^")-«(S")Pc.„p(^"iS")L>T) < 2\X\exp(^^'^^ (3) 

where 

) - E{ni(B«)} 

Proof. See Appendix [X] □ 

In the context of Lemma [5] above, -B" is said to be a (i.i.d.) causal sampling sequenc^ for Z". The 
multiplication of the sample's empirical distribution by the factor a(-) is referred to as a -normalization. 
Note that a(i3")P(,^p (Z"|i3") is in fact the vector of symbol occurrences in | S", normahzed by 
E(ni(_B")) = nq instead of by ni(i?"). Moreover, a(_B") 1 almost surely (a.s.), hence this vector 
converges a.s. to a probability distribution as n cx). 



Ill Channel Model and Main Result 

A (causal) channel over a common input and output finite alphabet X, is a sequence of conditional 
probability distributions W — {Wk{-\x'' ,y''~^)} over X, where x'^ G X'^,y'^~^ S X^~^ . Two sequences 
of r.v's (X°°,y°°) taking values in X are said to be a pair of input/output sequences for the channel 
respectively, if for any fc S N , a;*^ G X'^ , y'^ G X'^ , 

FY^ix^v-^iyklx^y"-') = Wkiyklx^y"-') 

^The i.i.d. sequence B" is only causally independent of Z"", but the two sequences may generally be dependent. For 
instance, setting Zi constant and Zj. = -Bfc_i satisfies the conditions of the Lemma. 
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We will find it convenient to model the channel's action on its input as the modulo-addition of a realized 
noise sequence Z°° corresponding to (X°^ ,Y°°), implicitly defined by 

Yk = Xk + Zk, ken (5) 

In fact, a channel can be equivalently defined by the conditional distribution of the noise sequence given 
past and present inputs and past outputs Wz^\x''Y''-^{^k\x'' ,y'^~^)- A channel W is called modulo- additive 
if the following Markov relation is satisfied for any pair of input/output sequences and any /c e N, 

Zk ^ x^-iy^-i ^ Xk (6) 

or equivalently, if Wk{yk + z\xk + z, x''~^ ,y''~^) is independent oi z € X for any x*^ € . Note that this 
definition of a modulo-additive channel allows the noise sequence to depend on previous inputs and outputs 
in a general way. The more restricted class of modulo-additive channels where the channel is completely 
defined by the noise distribution itself, is discussed below. 

The family of all causal channels over X is denoted by , and the family of all modulo-additive channels 
over X is denoted by C 'r^x- The families 'rfx and ^x are broad, including also non-stationary and 
non-ergodic channels. In particular, ^x includes the following families of channels sometimes used for 
modelling channel uncertainty |27| : 

• The Compound Memoryless Channel, which is a family of time-invariant mcmoryless channels, or in 
our notation all channels for which (informally) Wk{-\x'', y'^^^) = W{-\xk), where G S for some 
set S of conditional probability distributions over X . Specifically, ^x includes compound channels 
for which S consists only of (memoryless) modulo-additive mappings. 

• The Arbitrarily Varying Channel (AVC), which is a family of time-varying memoryless channels, or 
in our notation all channels for which (informally) Wk{-\x'^ ,y''^^) = Wk{-\xk), where each M^fe(-|-) G S 
for some set S (state space) of conditional probability distributions over X. Alternatively, an AVC 
can also be defined via the noise sequence by requiring Zk Xk ^ X^'^Y^"^ . Once again, ^x 
includes all AVC's for which S consists of only (memoryless) modulo-additive mappings. 

• Noise Sequence Channels: This family is denoted by -vVx, and consists of all (modulo-additive) 
channels that are completely defined by the noise sequence itself, i.e., for which the (stricter) Markov 
relation 

Zk ^ Z''-^ ^ j^feyfe-l 

holds for any fc G N. Note that some texts use noise sequence channels as the standard definition 
for a modulo-additive channel. Our definition for a modulo-additive channel is broader, allowing a 
general coupling between the noise sequence and previous inputs/outputs, and so ^x C ^x with 
the inclusion being strict. 

• Individual Noise Sequence Channels: This family consists of all noise sequence channels for which 
the noise sequence is an individual sequence Z°° — z°° , i.e., Wk{yk\x'' ,y^^^) ~ Sy^^^^+zk where 5ij 
is Kronecker's delta. It is a subfamily of ,yVx defined above, and may be viewed as an AVC with 
S being the set of all deterministic modulo-additive mappings. The example of the binary channel 
given in the introduction falls into this category. 

• Causal Adversarial Channels: Loosely speaking, a causal adversarial channel is one for which at 
each time point an adversary (jammer) chooses a (possibly random) input-output mapping according 
to some (possibly random) strategy, that may arbitrarily depend on previous channel inputs and 
outputs. It is easy to see that the family of causal adversarial channels is in fact equivalent to 
'^lox, since any strategy employed by the adversary can be equivalently described by the sequence 
{Wk{-\x'' ,y''~^)^ If the adversary is limited to use only modulo-additive mapping strategies, 
then this is equivalent to the family ^/£x- In the sequel, we will sometimes find it convenient to use 
the adversarial point of view. 

The communications problem with feedback over a channel W G "^x is now described. Without loss 
of generality, we assume a transmitter is in possession of a message point 9q G [0, 1), its binary expansion 
representing an infinite bit string to be reliably conveyed to a receiver over the channel W (later assumed 
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Figure 1: Channel model and a feedback transmission scheme 



to be unknown). A (sequential) feedback transmission scheme is described by a triplet (G, 5, A), where 
G = {gk : [0, 1) X X'^^^ t-^ '^Ifc^i is a sequence of transmission functions, S S 't^x is a feedback strategy, 
A = {Afc : X'' X X'^~-^ 1-^ is a sequence of decoding rules, and 3 is the set of all binary subintervals of 

the unit interval^. A scheme is said to use passive feedback if S consists of only deterministic conditional 
distributions, and is otherwise said to use active feedback. A scheme is said to use asymptotically passive 
feedback if the portion of non-deterministic conditional distributions within the first n elements of S tends 
to zero with n. 

A feedback transmission scheme {G, S, A) used over the channel W G 'rf^ with a message point Oq £ [0, 1) 
is described by the following construction, also depicted in Figure [T] 

• {X^,Y°°) is an input/output pair for the channel W. The channel input sequence is said to be 
generated by the transmitter, while the channel output sequence is said to be observed by the receiver. 

• {Y°°, U°°) is an input/output pair for the feedback strategy S. 

• The channel input sequence is generated by the transmitter for any fc G N, as follows: 



(8) 



The existence of an instantaneous noiseless feedback link is manifested through the fact that the 
feedback sequence U°° , which is causally generated from Y°° by the receiver via the feedback strategy 
S, is causally available to the transmitter. Note that passive feedback means that Uk is a deterministic 
function of , with the most common example being when the channel output is fed back to the 
transmitter, i.e., Uk = Yk- 



• The following Markov relation is satisfied for any fc G N: 

Yfc ^ x^Y''^^ ^ U^- 



(9) 



Loosely speaking, this relation guarantees that any randomness generated by the receiver (and shared 
with the transmitter via feedback) is "private", i.e., the channel/adversary has no direct access to it 
and its actions are based on observing channel inputs/outputs only. 

• Afc(y'^, U^^^) is the receiver's decoded interval at time fc. 

The construction above uniquely determines the joint distribution of {X°° ,Y°° ,U°"). If transmission is 
terminated at time n, the receiver decodes bits that correspond to the decoded interval A„(y", J7"~^) as 
being the leading bits in the message point's binary expansion. In accordance, the associated rate and 
(pointwise) error probability at time n are defined as 



,0o) = -- log|A„(y",C/""i) 



Pe(n,W,0o)=P(^^O^A„(r", [/"-!)) 



(10) 



^There is a one-to-one correspondence between any finite binary string ^162 ■ ■ ■ ^fc and a binary subinterval [a, 0) C [0, 1) 
where a = 0.6162 ■ • ■ 6fe and P = a + 2~* (similar to arithmetic coding). 
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Modelling the channel actions as the modulo-addition of a realized noise sequence as in ^ allows us 
to define the (modulo-additive, zero-order) empirical capacity at time n as 



where the r.v. iJemp(-Z^") is the zero-order empirical entropy of Z". Hence, the empirical capacity is 
the capacity of a corresponding memoryless modulo-additive channel, with a marginal noise distribution 
that coincides with the empirical distribution of Z". In general, both the instantaneous rate and the 
empirical capacity are r.v's with distributions that depend on the channel, the message point and even the 
transmission scheme itself (the latter dependency is suppressed). Note however that in the special case 
where communications take place over a noise sequence channel W S ^x-, the empirical capacity depends 
only on the channel, and for an individual noise sequence channel, it is deterministic. 

The universal communication problem over a family of channels ^ C is now described. Suppose 
a feedback transmission scheme (G, iS, A) is used for communication over an unknown channel W G ^ . 
Regarding the empirical capacity as a measure for how well the channel behaves, a desirable property would 
be for the scheme, although being fixed and independent of the actual channel in use, to achieve rates close 
to the empirical capacity with a low error probability. Making this notion precise, a scheme (G, iS, A) is 
said to (imiformly) achieve the empirical capacity over the family J^, if 



where all e\{n)^£2{n)-,£z(n) 0. Such a scheme is also called universal for the family T. 

In the discussion so far we have considered horizon-free transmission schemes, namely schemes that do 
not depend on any decoding deadline and can be terminated at any time. In the sequel, we also consider 
finite-horizon schemes, which are schemes that must terminate at some given time n (horizon). The 
horizon-free construction and the subsequent definitions of rate and error probability immediately carry 
over to the finite-horizon setting, via simple truncation. A sequence {(G, 5, A)„}^-^ of finite-horizon 
transmission schemes, with (G, S, A)„ having a horizon n, is said to achieve the empirical capacity over a 
family if for any n G N the scheme (G, S, A)„ satisfies ([T2|) . and £i(n), £2(71), £3(71) ^ 0. A finite-horizon 
scheme is loosely said to achieve the empirical capacity, whenever it is clear that a suitable sequence of 
such schemes with an arbitrarily large horizon can be constructed. We now state our main result: 

Theorem 1. There exists a horizon-free feedback transmission scheme (G, 5, A) using asymptotically pas- 
sive feedback, that achieves the empirical capacity over the family ■ Such a universal scheme is con- 
structed explicitly below. Furthermore, the scheme can be adapted to achieve the empirical capacity over 
the larger family '^x , if common randomness is available. 

Proof. The rest of the paper is dedicated to the construction of the universal scheme and hence to the 
proof of the Theorem. The discussion in the body of the paper focuses on a finite-horizon feedback 
transmission scheme, which is introduced in section IIVI and shown to be universal for the family ^x of 
modulo-additive channels in section [V] The horizon-free variant of this scheme is discussed in appendix 
IB| and the adaptations (via common randomness) required to obtain universality for the family '^x of all 
causal channels , are relegated to Appendix [C] □ 

The following remarks are now in order: 

1) The probabilities in and p2)) are taken over the randomness created both by the feedback strategy, 
and by the channel. The randomness due to feedback is negligible yet essential as manifested by 
the special case of an individual noise sequence channel, where without randomness one is limited by 
Berlekamp's results |15| and the empirical capacity cannot be attained, even being known in advance. 
Note also that the definition in (|12p requires uniform convergence over the message point. This is the 
variablc-ratc counterpart of a maximum error probability criterion, and from an adversarial viewpoint 
is equivalent to the assumption that the adversary knows the message point. 



CrP(W,0o)=log|A'|-i/emp(^") 



(11) 



sup Pe{n,yV,6o) < £i{n) 
we,^,eoG[04) 



inf P(i?„(W, 9o) > C„"°P(>V, 9o) - e2{n)) > 1 - Ssin) 



(12) 
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2) As already mentioned, when communicating over an unknown member of or 'ffx no rate can be 
guaranteed in advance since both families include (many) channels with zero capacity. Furthermore, 
since the channel law may vary arbitrarily there is no hope to identify the actual channel in use and 
attain its capacity, even with feedback. Our approach is more "optimistic" : We disregard the complex 
nature oi '^x and model the channel actions as being memoryless modulo-additivc, although these 
are usually not. This simple model allows us to opportunistically attain rates that correspond to 
the empirical goodness of the realized channel (measured relative to our model), no matter what the 
true channel law is. In the special case of noise sequence channels, we obtain universality w.r.t. any 
competing scheme that is informed of the empirical distribution of the noise sequence in advance. Of 
course, more complex models for channel actions can be considered. For instance, one can model the 
actions as being modulo-additive with some Markovian statistical dependence, or as being memoryless 
but input dependent. These more elaborate models allow to universally approach suitably defined (and 
possibly higher) empirical capacities, and are discussed in part II of this work [22j . 

3) The empirical capacity over the family j^x is achieved essentially without common randomness, since 
the negligible amount nevertheless required can be generated via feedback. However, when communi- 
cating over the family '^x we require common randomness that cannot be accommodated by feedback. 
As described in Appendix [Cl this randomness is chiefly used for dithering, i.e., making the input distri- 
bution uniform. This is not merely an artifact, but has to do with the fact that the empirical capacity 
is defined in terms of a noise sequence, and for channels in 'lox the empirical distribution of the realized 
noise sequence depends on the empirical input distribution, a dependence which a modulo-additive 
model cannot capture. For instance, consider the extreme case of a binary channel where the channel's 
output at each time point is randomly chosen in an i.i.d. fashion to be ~ Ber(e), independently of the 
inputs. This memoryless channel is in "^{0,1} (but not in ^{0,1}), and its capacity is of course zero. 
Suppose one tries to communicate over this channel using some transmission scheme, and at the end 
of transmission the empirical distribution of the inputs turns out to be q. Then with high probability, 
the empirical distribution of the realized noise sequence will be close to g * e = q{\ — e) + (1 — q)e, and 
the empirical capacity will therefore be close to 1 — /lij (<? * e) which is positive for \. This example 
demonstrates that for the family 'lox , unless the input distribution is guaranteed to be close to uniform 
with high probability, the empirical capacity as defined may not be the right quantity to look at. 

4) When W G ^x happens to be a memoryless channel, the empirical capacity converges a.s. to the 
classical capacity of the channel. This is of course generally untrue for memoryless channels W G "^x 
(with dithering). It is straightforward that due to the modulo-additive modelling, the empirical capacity 
cannot exceed the mutual information of W with a uniform input, but in fact the penalty may even be 
larger. For example, consider a general binary memoryless channel W € "^{o.ijj described by 

Wu{3W = i.x'^-\y''-^)=v^3 hi e {0,1} 

With a uniform input (obtained via dithering), the empirical capacity will converge a.s. to 1 — ft.B(^(poo + 
Pii)). This quantity is the capacity of a BSC obtained by averaging the channel W with its "cyclicly 
shifted" counterpart, which is a binary memoryless channel characterized by the transition probabilities 
qij = Pi+i,j+i (modulo addition). By the convexity of the mutual information in the transition matrix, 
and due to the symmetry between W and its cyclically shifted counterpart, the capacity of this BSC is 
upper bounded by the mutual information of W with a uniform input. Furthermore, unless W happens 
to be a BSC to begin with, this inequality is strict and the empirical capacity is a.s. strictly smaller 
than the mutual information of W with a uniform input. The discussion is easily extended to larger 
alphabets. This point and related issues mentioned in the previous remark, are further pursued in part 
II of this work [22] . 

IV The Universal Scheme 

In this section we introduce a finite-horizon transmission scheme achieving the empirical capacity over the 
family We find it instructive to focus our discussion on this setting, as it is simpler yet includes 

all the core ideas. The more exhaustive horizon-free scheme and its extension to the larger family Cx 
using common randomness, are discussed in Appendices [B] and [C) We start by building intuition for the 
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binary alphabet case, followed by a step by step construction of a binary alphabet universal scheme. This 
scheme is then generalized to a finite alphabet setting via some simple modifications. The rate and error 
probability analysis of the scheme appears in section |Vl In this section, transmission is assumed to take 
place over a fixed period of n channel uses. 



IV. 1 The Horstein Scheme for the BSC 

We first discuss the simple case where the channel in use is known to be a BSC with a given crossover 
probability p, which in our terminology means a noise sequence channel (i.e., one satisfying the Markov 
relation ([7|)) with an i.i.d. ^ Ber(p) noise sequence Z°° . For this setting, we describe the well known passive 
feedback transmission scheme proposed by Horstein jj. In that scheme, the message point is assumed to 
be selected at random uniformly over the unit interval. The receiver constantly calculates the a-posteriori 
probability distribution of the message point given the bits it has seen so far. These bits are passively fed 
back to the transmitter (namely Uk = in our terminology), which can therefore calculate the posterior 
as well. A zero or one is transmitted according to whether the message point currently lies to the left or 
to the right of the posterior's median point. Thus the transmitter always answers the most informative 
yes/no question that can be posed by the receiver. 

Specifically, let Qq be the random message point and denote its posterior density at time k (given the 
observed outputs) by fk{0) ~ .f0a\Y''{(^\y'') for 9 G [0, 1). Denote the median point corresponding to fk{d) 
by /ifc. Since 8o is uniform over the unit interval, we have fo{d) = l[o,i)(^) a-nd fio = i. The transmission 
functions are hence given by 

k-i\ _ / 9o < /ifc-i 



and the transition from fk{9) to fk+i{9) is given by: 



fk+m 



2{pyk+i + q{l - yk+i))fk{0) < Hk 
2{qyk+i +p{l - yk+i))fk{d) > fik 



where q = 1— p. The transition from fkiO) to fk+i{9) and the corresponding transmission of Xk — 
Qk (^0)^*^"^) a-re referred to in the sequel as a Horstein iteration. Several optimal decoding rules are 
associated with the Horstein scheme. A fixed rate R rule is to decode the binary interval of size 2~L"flJ 
with the maximal posterior probability, which is our notations reads 

A„(2/",j/"-i) = A„(j/")= argsup //e„|^.(0|2/")d0 

/e3 .|7|=2-L"«J J I 

A variable rate rule with a target error probability pe, is to decode the smallest binary interval with a 
posterior probability exceeding a threshold 1 — Pe. There is also the bit-level decoding rule in which a 
bit is decoded whenever its corresponding binary interval has accumulated a posterior probability greater 
than 1 — Pe, where p^ is a target probability of bit erroi[§. The Horstein scheme has been long conjectured 
to achieve the capacity of the BSC with either a fixed or a variable rate decoding rule, but this fact was 
proved in rigor only recently [28j . 



IV. 2 Binary Channels with Noise Constraints 

Let us now take a step towards the unknown channel setting by considering a subfamily of ^{0,1} where 
the empirical distribution of the noise sequence is known in advance to a.s. satisfy pemp(-^") < p < \ 
(e.g., an individual noise sequence with a fractions of 'I's smaller than p). From an adversarial point of 
view, this can be thought of as imposing a power constraint on the adversary. A plausible idea would be 
to communicate by performing Horstein iterations using p in lieu of the crossover probability, hoping that 
the average performance of the scheme in the BSC setting will carry over to this more stringent setting, 
i.e., enable to achieve 1 — h^ip) uniformly over the noise-constrained family. Unfortunately, this is not the 
case since Berlekamp's results jl5j imply that for many values of p there exist pairs of message points and 

®For instance, when the posterior probability (w.r.t. fki^)) of either [O, ^) or [i, l) exceeds 1 — pe, the MSB of the 
message point is decoded as cither or 1 respectively. 
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individual noise sequences (satisfying the constraint) for which decoding will surely fail. Nevertheless, as 
we now show, there is little information missing at the receiver to get it right. 

We now make two key observations regarding the Horstein transmission process. First, wc notice that 
fkid) is a quasi-constant function, over at most fc + 1 distinct intervals whose union is the unit interval. 
Second, when transmission is terminated after n channel uses, we have 

fn{0 = Oo) = 2"(1 = 2"(1 (13) 

where 9o is the message point. This stems directly from the fact that 9o is always on the correct side of 
the median, so that its density is multiplied by 2(1 — p) when there is no error, and by 2p otherwise. Now, 
let the message interval at time k be the interval containing Oq over which fk{d) is constant, and let 2~^ 
be its length at the end of transmission, for some £ > 0. Using ((T3)) we have that 

2^' ■ fn{0 ^ < 1 => ^>n(l-/iB(Pcmp(^")) -i?(pcmp(^")||p) 

Now, assume the decoder could identify with certainty which is the message interval at the end of transmis- 
sion. In that case, the common most significant bits in the binary expansion of points inside the message 
interval (which also correspond to the message point itself) could be decoded, error free! This means that 
an instantaneous decoding rate of 

Rn^^>l- /iB(Pcmp(^")) - i?(pcmp(^") lb) " ^ (14) 

could be attained. Actually, another information bit is required to allow the above rate, as the message 
interval may sometimes be inconveniently located over the binary grid and not enough bits (if any) can be 
decoded. We further elaborate on this point in subsection IIV.4I when the full scheme is presented. Notice 
that the expression in ([M]) can be divided into two parts: 1 — /iB(pcmp) is the empirical capacity of the 
channel, and -D(j>emp \\p) is a penalty term for using the maximal value p of Pomp instead of Pomp itself. 
Note also that 

inf Rn > l-h^{p)-- (15) 
Pcmp<p n 

and therefore the rate attained by this variable rate scheme (had the message interval been known at the 
end of transmission) is guaranteed to be asymptotically no less than 1 — /ib(p), for any message point 

^oe [0,1). 

As observed before, there are at most n + 1 distinct intervals over which /„(6') is constant, therefore 
no more than [log {n -j- 1)] bits of side information are required in order to identify the message interval 
at the end of transmission. This means that the decoding rate in is achievable (error free) if only 
1 -I- [log(n -I- 1)] bits could be reliably conveyed to the receiver at the end of transmission. Thus, while 
[15j determines that it is generally impossible to communicate at such rate with no errors, the size of the 
decoding uncertainty preventing us from attaining it is very small. For instance, if list decoding is allowed 
then (|15p could be attained using a list whose size grows only linearly (and not exponentially) with n. 

In the following subsections we present a simple randomization technique by which these extra bits can 
be reliably conveyed to the receiver, with no asymptotic decrease in the data rate, so that the decoder can 
determine with high probability the correct message from the list. This technique requires a sub-linear 
number of random bits shared by the transmitter and the receiver (obtained via feedback, or possibly via 
a common random source). Moreover, through a sequential use of randomness we will be able to present a 
feasible transmission scheme that tracks the empirical distribution of the realized noise sequence, so that 
a significantly higher rate approaching the empirical capacity 1 — hsipempiZ")) is attained, avoiding the 
penalty term in (fT4|) . 

IV. 3 Sequential Probability Estimation 

We now turn to the general case where the channel in use is an arbitrary unknown member of ^{o^iy, i.e., 
from the adversarial point of view there are no constraints (besides causality) on the way the noise sequence 
is generated by the adversary (e.g., the noise may be some unknown individual sequence). A reasonable 
idea could be to plug in a sequential estimator for the empirical distribution of the noise Pcmp{Z^) into 
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the Horstein iterations, that is, to let the "crossover probabiHty" used by the receiver to calculate the so 
called "posteriori distribution" of the message point, vary with time. Specifically, this amounts to 

r 2(pfc+i(Z^-)y,+i + Qk+iiZ^Kl - yk+i))fk{0) 6 < ^iu 

jfc+iw \2{qk+i{z^)vu+i+%+i{z''){i-Vk+i))h{e) e>^lk 

where Pk+i is a sequential estimator applied to the noise sequence and Qk+i — 1—pk+i- Note that fkiO) 
is still a probability density function, but looses the meaning of a true posterior in this unknown channel 
setting. We henceforth loosely refer to /fc(0) as the empirical posterior of the message point (relative to 
the estimator in use). 

This idea is of course problematic, since the noise sequence is causally known only to the transmitter 
and not to the receiver, but for the moment let us assume that the estimates can be somehow made known 
to the receiver, and take care of this point later. The first of the two key observations from the previous 
subsection still holds, i.e., fk{6) is quasi-constant over k + 1 disjoint intervals whose union is the unit 
interval. The empirical posterior evaluated at the message point at the end of transmission is now equal to 

n 

Me = 9o) = 2" n p/Mi - Pk)'-^" = 2"p(z") 
fe=i 

where p{Z^^) is the probability assigned to the entire noise sequence Z" by the estimator in use. Using this 
fact and assuming again that at the end of transmission we know which one of the intervals is the message 
interval (which is then set to be the decoded interval), the instantaneous decoding rate attained is given 

by 

Rn > -Llog2"p(Z")J >1-- logp(Z") - - 
n n n 

so the shorter the codelength assigned by the estimator to the noise sequence, or the more compressible the 

strategy of the adversary is (w.r.t. a memoryless modulo-additive model), the higher the achieved rate. It 

is therefore only reasonable to make use of the KT estimator, or more generally an intermittently updated 

KT(6) estimator. Applying Lemmas [2] and [3l the instantaneous decoding rate achieved when using a KT(&) 

estimator is 

Rn>l-- logp--<'' (Z") - - > 1 - /i«(pemp(^")) - K^ = CrP(W, ^o) - K, (16) 

n n n n 

where Ki > is constant. Thus, if &n~^logn = o(l) the empirical capacity is asymptotically attained. 
This holds however only under the assumptions that the receiver knows the KT(6) estimates online, and 
can also recognize the message interval with certainty at the end of transmission. 

There are two key elements that make this approach work. First, the update information required by 
the receiver so that the assumptions above are satisfied can be made negligible, namely have rate zero. The 
message interval is one of at most n+l possible intervals hence requires only [log(n-|- 1)] bits, and using a 
KT(6) estimator only requires to communicate the number of 'I's in the last b channel uses, which requires 
[log6] bits per b channel uses and is neghgible if b~^ \ogb = o(l). Note the core tradeoff between a small b 
required to obtain a small redundancy term in ()16p . and a large b required to make the update information 
rate negligible. Second, as we shall see it is possible to obtain reliable zero rate communications over an 
unknown member of ^{0,1} as long as the empirical capacity is not too small, and that the latter condition 
can be identified with high probability. These two observations allow us to make the seemingly unfeasible 
approach described so far into a practical scheme that achieves the empirical capacity. 



IV. 4 A Universal Binary Alphabet Scheme 

In this subsection we introduce the universal scheme achieving the empirical capacity for the binary al- 
phabet, finite-horizon case. Let us first provide a rough outline of the scheme. Transmission takes place 
over a period of n channel uses, which is divided into blocks of equal length b ~ b{n). Inside each block, 
Horstein iterations are performed over the majority of channel uses, always using the most updated KT 
estimate. Update information containing the number of 'I's in the previously accepted block together 
with the index of the current message interval, is coded using a repetition code and passed to the receiver 
over randomly selected positions inside the block, which are selected via feedback. The idea is that since 
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positions are random, the "effective" cliannel for the update information transmission is roughly a BSC 
with transition probabihties close to the empirical distribution of the noise sequence inside the block. This 
distribution is estimated using a randomly positioned training sequence, and if the estimation is too close 
to being uniform, the block is discarded. Otherwise, the update information can be reliably decoded with 
high probability. Loosely speaking, the discarding process partitions the noise sequence into a "good" part 
and a "bad" part, and with high probability the empirical capacity of the latter part is small. Therefore, 
discarding the "bad" part increases the rate with high probability, due to the concavity of the entropy. 

If a block is accepted then the polarity of the "effective" crossover probability (i.e., above/below ^) 
can be reliably determined, and hence the update information (which has a negligible rate) can be reliably 
decoded. Once the update information is successfully decoded, the receiver uses the number of 'I's in the 
noise sequence from the previous block to update the KT estimate, which is then used for communications 
in the next block. At the end of transmission, the last known message interval is set to be the decoded 
interval. 

What takes place inside each block is now described in detail. We define four types of positions within 
the block - regular positions over which Horstcin iterations arc performed, training positions over which 
a training sequence is transmitted, update positions over which update information is transmitted, and 
active feedback positions used to select the random positions for the other types. The non-active feedback 
positions (regular, training and update) are referred to as passive feedback positions. Apart from active 
feedback positions, the receiver passively feeds back what it receives (i.e., Uk = Yk over these positions). 

(A) Random positions generation (active feedback): We set a parameter m ~ m{n) which will 
indirectly determine the number of non-regular positions. The active feedback positions are always 
at the beginning of the block, and occupy exactly ba = ba{n) positions where ba is determined in 
the sequel as a function of to, b. The active positions are used in order to synchronize the terminals 
regarding the type of each passive position that follows. The number of passive positions is fixed and 
given by bp = b—ba. The type of each of the bp passive positions is determined by an i.i.d. sequence A^p 

over the alphabet {training , update, regular} with a marginal distribution given by 1 — . 

The selection of A^^ is synchronized between the transmitter and the receiver as follows: 

(Al) The receiver randomly selects the sequence A^^ according to the i.i.d. distribution above. Let 
the r.v's (Mt, Af„, M^) denote the number of occurrences of the corresponding symbols in A^p, 
and note that E(Mt, Mu, Mr) = (to, to, bp — 2m). 

(A2) The type of the sequence^ A^p is then binary encoded and sent via feedback over active positions, 
which requires no more than 2 [log 6] bits. 

(A3) If both ^ < Mt, AIu < 2m (not too little or too many training and update positions) then the 
index of the sequence A^f within its type*" is communicated via the feedback (active positions), 
which requires no more than 4m [log 5] bits. Otherwise, the transmitter overrides the receiver's 
selection, and randomly selects the sequences A^p itself. 

Now, another sequence F^^" is selected, determining which of the update bits is to be transmitted 
over which update position (with repetitions). The number of update bits (see step jC]) below) is 
2[log(n -|- 1)], hence T^'" is selected in a uniform i.i.d. fashion over the alphabet (2[log(n + 1)]). If 
both ^ < Mt,Mu < 2m, then F^" is selected by the receiver, binary encoded using no more than 
2to[1 -|- log[log(n -I- 1)] )] bits and sent via feedback over active positions. Otherwise, F*^" is selected 
by the transmitter. 

Let us now assume that b > log {n + 1), and set the total number of active positions to ba = 8TO[log6] , 
which is sufficient to accommodate the synchronization process described above. If 6~^mlog6 — o(l), 
then this amount becomes negligible and the feedback strategy is asymptotically passive. Figure [2] 
depicts a "typical" position assignment within a block. 

(B) Training transmission: A training sequence is transmitted over the Mt random positions as de- 
termined by A^p . At the end of the block, the receiver calculates the training estimate p'''^'" for the 
empirical distribution of the noise in the block, which is a coarse estimate later used for block dis- 
carding. Let Z^p denote the noise sequence over passive positions within the current block, i.e., if this 

'^Note that by type of a sequence we refer to the vector of symbols occurrences, not to bo confused with position types. 
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Figure 2: A block always begins with ba active feedback positions, followed by a bulk of Regular positions 
that are randomly replaced with (on the average) m Training positions and m Update positions. 

is the kih block then = Z^^_i^ Let B^^ be the corresponding training pattern sequence, i.e., 
Bk = 1 training (Afc). Thc training estimate is set to 

pt-- A „ (s^) p^^^^ (^z'n B'") = (^^) • p,„,p (Z^i S"'') (17) 

where the a-normalization factor is defined in ([¥]). 

(C) Update transmission: Update information is transmitted over the Mu random positions determined 
by A^p. Thc uncodcd update information includes thc following quantities, all binary encoded: 

(CI) The number of 'I's in thc noise sequence over regular positions in thc previously accepted block 
([log 6] bits). 

(C2) The index of thc message interval w.r.t. the interval partitioning of the empirical posterior at 
the end of the previously accepted block ([log {n + 1)] bits at thc most). 

(C3) One ambiguity resolving bit which is discussed later on. 

The number of uncoded update bits in total is therefore no more than 2[log(7i + 1)] , and for simplicity 
wc assume that exactly 2[log(n + 1)] uncodcd update bits are to be transmitted (and e.g. zero pad 
if necessary) . Now, on the fcth update position (determined by A^p ) the transmitter sends thc Ffc-th 
uncoded update bit. By properly tuning the scheme parameters wc typically have M„ 3> logn, hence 
each update bit is coded using a repetition code with a random number of repetitions. 

(D) Horstein iterations with KT(6) estimates: Horstein iterations arc performed over thc (random) 
Air regular positions as determined by A^p . Thc "crossover probability" used is thc most updated KT 
estimate of the empirical noise distribution available to the receiver. On the kth block, this estimate 
is given by 

l + E,ti'M.(.7)/(j)' 

where ni(j) is the number of 'I's the receiver assumes appeared in thc noise sequence over regular 
positions in the jth block, as communicated by the update information so far (may be different than 
thc actual number due to errors), Mj.{j) is thc value of A/,, (number of regular positions) on thc jth 
block, and /()) is an indicator function that evaluates to one if thc jth block was accepted, and to 
zero if it was discarded. Note that thc estimator works on thc sequence of accepted positions, and is 
always two accepted blocks behind. 

(E) Block discarding: The block is discarded if cither Aft,M„ are out of range (in thc sense of (jASP ). 
or if 

l|P*"'"-PnL<T. (18) 

for some discarding threshold Ta{n) = o(l), where is the uniform distribution over {0, 1}. Otherwise, 
the block is accepted. When a block is discarded, the transmitter and receiver return to the state they 
were in before the block has started. 
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(F) Update information Decoding: For an accepted block, the update information is decoded accord- 
ing to the estimated noise probabihty, as follows. Let Y^p denote the channel output sequence over 
passive positions within the current block, i.e., if this is the kth block then Y'^ = Y^l^_^ ^j^. Let B^^^-^ be 

the repetition pattern sequence of the ith update bit (as determined by A*", F^^" ), i.e., a binary sequence 
with 'I's only in update positions that correspond to a repetition of that bit. For any i G (2 [log??]), 
the receiver calculates the following update estimate for the ?th bit: 

where a-normalization is used again. Intuitively, we expect the update estimate to be close to the 
training estimate only when the corresponding update bit was a '0', unless the noise sequence within 
the block is close to uniform in which case it is likely to be discarded anyway. Accordingly, the decision 
rule for the ?th update bit is given by 

||pupd,i„ptrain||^ J (^g) 

for some update decision threshold Tu{n) = o(l), where in case of an equality a '0' is decoded. The 
decoded information is used to update the KT estimate, and to store the new identity of the message 
interval. 

Decoding Rule: Ideally, when transmission ends one would like to decode the minimal binary interval 
(i.e.; its corresponding MSB's) containing the last message interval given by the update information, which 
(if not error has occurred) contains the message point. However, as happens in arithmetic coding, sometimes 
this minimal binary interval is much larger than the message interval itself (for instance, if the message 
interval contains the point ^ then we cannot decode even a single bit). To solve this, note that it is possible 
to divide the interval [0, 1) into binary intervals of size corresponding to the message interval's size, such 
that the message interval intersects no more than two of those, and the only uncertainty that may be left 
is which one. The ambiguity resolving bit mentioned in step ([C]) above is used to that end. The receiver 
thus uses the following decoding rulo Seek the two smallest adjacent binary intervals (of the same size) 
whose union contains the last known message interval, and decode one of them according to the last known 
ambiguity resolving bit. This rule guarantees that the decoded interval is less than twice the size of the 
last message interval. 

As we show in section|Vl by properly selecting the dependence of the scheme parameters on the update 
information is guaranteed to be correctly decoded with probability approaching one without causing any 
asymptotic decrease in the data rate, thus allowing the empirical capacity to be approached. 

IV. 5 A Universal Finite Alphabet Scheme 

We now describe the (finite-horizon) finite alphabet X variant of the universal scheme. Suppose the 
transmitter and the receiver can agree on a sequential KT(6) estimator p^'^''''\-\Z^~^) for the noise sequence 
at each time k (as in subsection llV.Sp . Horstein iterations using this estimate are performed as follows. The 
empirical posterior is initialized as before to a uniform distribution over the unit interval foiO) = l[o.i)(0). 
At each time point fc, unit interval is divided into \X\ consecutive subintcrvals with identical probability 
jA"!"^ under the empirical posterior fk^i{6), and the transmitter sends a symbol that corresponds to the 
subinterval containing the message point ^o- Upon receiving e A", the receiver generates the new 
empirical posterior fk(Q) by multiplying fk-i{0) in the interval corresponding to the symbol i G A" by 
the factor \X\pi^'^'-''\Yk — i\Z''~^), where the minus sign is the modulo-subtraction operatoil^. Thus the 
message point is always in the interval multiplied by the estimate corresponding to the next value of the 
noise, and hence we have 

/„((? = (?o) = |A'|"p^--"<"(Z") 

where p ^'^'''^(Z") is the probability assigned to the entire noise sequence by the KT(6) estimator. Similarly 
to the binary case, there are no more than ?t.(|A'| — 1) -|- 1 subintcrvals over which fn{Q) is constant. 

*If all blocks were discarded, the decoded interval is trivially taken to be [0, 1). 

^Thc probability of the itli interval under fk{8) is exactly pJ^^'^'CYfe — i\Z^^^), hence fk{d) is a probability distribution. 
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Assuming further that the message interval index is known to the receiver at the end of transmission, then 
using Lemmas [2] and [3] for the KT(5) rcdimdancy, the achieved rate is given by 

i?„ > log \X\ - ife.p(Z") - x.ffi^ . ^o) - /v.ffi^ (20) 

for some K2 > 0. Once again, the update information required by the receiver for the above assumptions 
to hold incurs in a vanishing rate penalty, and can be reliably transmitted over random positions if the 
empirical capacity is not too small. 

The finite alphabet scheme now follows a similar path to that of the binary alphabet scheme. The 
transmission period is divided into blocks of equal size b = b{n), and inside each block wc again have the 
active feedback, training, update and regular positions. However, the update information transmission and 
decoding is somewhat more involved in this case. We now describe what happens inside each block: 

(A) Random position generation (active feedback): We use the same parameters b,m,ba, generate 
the corresponding r.v.'s Mt, Af„, Air, and pick the sequence A'''' the same waj0. However, the sequence 
pA/u jg j^Q.^ selected uniformly over an alphabet — l)s), where s = s{n) corresponds to a larger 
number of update bits, and is defined in step ([C| below. Again, apart from active feedback positions 
the receiver passively feeds back what it receives. 

(B) Training transmission: The training estimate p*''^™ is given by pT|) . 

(C) Update transmission: Update information is transmitted over the Mu random positions determined 
by A^'p. The uncodcd update information includes the type (symbol occurrences) of the noise sequence 
over regular positions in the previously accepted block, the index of the message interval w.r.t. the 
interval partitioning of the empirical posterior at the end of that block, and one ambiguity resolving 
symbol. Using a binary representation, the total number of uncoded update bits is no more than 

log6(l'^l^i)l + riog(7i(|A'| -1) + 1)] +1< 2(1^-1 -l)riog(n + 1)1 ^ s(n) (21) 

and again we zero pad the uncoded update bits up to the length s above, for simplicity. For a non- 
binary alphabet, using a random "repetition code" method similar to the one used in the binary 
alphabet case may result in a decoding ambiguitj0, and thus a different method must be used. The 
sequence F*^" takes values over an alphabet (d^"! — l)s), so wc can write for any k, Tk — i + js for 
some i G (s) and j £ ( \X\ — 1). Following this representation, in the fcth update position we send one 
of the symbols from the pair {0,j + 1}, where which one is determined by the ith uncoded update 
bilF^. For a suitable selection of parameters, this procedure guarantees (with high probability) that 
any of the uncoded update bits is sent several times using pairs of channel inputs with any possible 
modulo-additive distance. This in turn guarantees that any bit is resolvable with high probability via 
at least one of the pairs, unless the empirical capacity of that block is close to zero. 

(D) Horstein iterations with KT(6) estimates: Similar to the binary alphabet case. Horstein itera- 
tions are performed over the (random) Mr regular positions determined by A^^" , using the most updated 
KT estimate of the empirical noise distribution (over accepted blocks) available to the receiver. 

(E) Block discarding: Same criterion as in the the binary alphabet case, where now p„ in ([T8|) is taken 
to be the uniform distribution over X. 

(F) Update information decoding: For an accepted block, the update information is decoded using 
ptram foUows. Let B^'l'j) be the repetition pattern sequence of the ith update bit using the inputs 



^''Notc it is possible to reduce the number of active positions ba since the feedback has a larger capacity now, but this has 
a negligible effect, and for consistency we refrain from doing so. 

^^This occurs when the empirical distribution of the noise inside the block is invariant under some cyclic shift. Take for 
example the distribution (0.4, 0.1, 0.4, 0.1) over a quaternary alphabet, in which case one cannot separate, say, the all 'O's and 
the all '2's repetition words, but the empirical capacity is nevertheless positive. Even in the simple modulo-additive DMC 
setting with the above noise distribution, one would use only two inputs to attain capacity, say the first and the second. 

i^We use '0' {0} ,'1' {j + 1} 
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{0,j + 1}, as determined by A'',r^^". For any i e (s) and j S ( jA"! — 1), the receiver calculates the 
following update estimate: 

The decoding rule for the ith update bit is given by 



max 



P 



train 



p 



upd 



(22) 



where in cases of an equality a '0' is decoded. The decoded information is used to update the KT 
estimate, and to store the new identity of the message interval. 

The decoding rule used is the same one as in the binary setting, see subsection IIV.4I In the following 
section, we analyze the performance of the described scheme, proving it is universal for the family ■ 



V Analysis 

In this section we analyze the performance of the finite-alphabet finite-horizon scheme presented in subsec- 
tion llV.Sl and show it achieves the empirical capacity in the limit of infinite horizon, for a suitable selection 
of the parameters m(n), b{n), T^{n), Tu{n). For brevity, the dependence of the parameters on n will usually 
be omitted. In the sequel, we also show how faster convergence to the empirical capacity is obtained when 
operating over noise sequence channels, and discuss the amount of randomness generated by the scheme. 
The following observation plays a key role in our subsequent derivations: 

Lemma 6. For any specific block, let Z^'^ , B^p and B^(!l j) be the corresponding noise sequence over passive 
positions, training pattern sequence, and repetition pattern sequence for the ith update bit with the input 
pair {0, j + 1}, respectively. Then B^p and B^(J j) each constitute a causal sampling sequence for Z^p . 

Proof. See Appendix El In short, the training/update positions arc i.i.d. by construction, and causal 
independence is established by combining that with the Markov relations ^ and (O. □ 



V.l Error Probability 

The only source for error in our scheme lies in the incorrect decoding of update information in the last 
accepted block before transmission is terminated, which causes the wrong message interval to be decoded. 
However, for simplicity of the exposition, we leniently assume that an error is declared whenever the update 
information in any of the blocks is erroneously decoded. 

The error probability is hence upper bounded by the probability of erroneous update decoding in any 
of the blocks. Therefore, we now focus on a specific block and find the corresponding update decoding 
error probability, where it is emphasized that while discarding a block has an impact on the rate, it does 
not constitute an error event. The noise sequence over passive positions in the block is denoted as before 
by Z''". For any i G (s) and j e - 1) , define 



A 

= a 



B\ 



P, 



cmp 



3) 



(23) 



which is the counterpart of p^^^% , yet sampling the noise sequence rather than the output sequence over 
the update positions corresponding to the ith update bit and the input pair {0, j + 1}. Define the following 
two events: 



P 



>r}, 4 I 



max 



P(^,J) - Pe.r.piZ^") 



> T 



For some T(n) = o(l). We assert that for a suitable selection of the thresholds {Ta,Tu,T), a necessary 
condition for an update decoding error in the block is given by the event EiU E2. To see why this holds. 
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let us assume the complementary event n E2 and show it implies no update decoding errors for a 
suitable thresholds selection. If the block was discarded then surely there is no error, so assume further 
the block was not discarded. Now consider the ith update bit. If this bit is a '0' then the channel input 
at the corresponding update positions (determined by B^(l .■^) is G A", and therefore P(i.j) = P(i'j) ^"-"^ ^"^^ 
j G d^"! — 1). Thus in this case wc have 



P 



upd 



Pi 



p 



< 



P. 



cmp 



ptrain _ ^ 



cmp 



< 2t 



which holds for any j G {\X\ — 1). Therefore, if we set 2r < t,j then the above together with the update 
decoding rule ((^ imply that the ith update bit is correctly decoded. Now suppose the zth update bit is 
a '1', in which case the channel input at the corresponding update positions is {j + 1) £ X, and thus p"/'^) 

is a cyclic right-shift of P(ij) by j + 1 positions. Writing pimpiZ'^") for a cyclic left-shift of p^^^p{Z''p) by 
j + I positions, we have the following chain of inequalities: 



upd 



train 



(a) 
> 



upd 



{z'^) - P 

00 



train 



(c) 

> 



P{^,3) - P' 



-PempiZ"') 

Pi^.j)~P 
(d) 

-2t > 



(b) 
> 



upd 

P[Ij)~P'^ 



p[-J^{Z'^) - p^^^JZ'^) 



(24) 



- 2r 



In transition (a) we used the triangle inequality for the £00 norm, transition (b) holds since wc assume E^^ 
in (c) we use i?2, and also replace a cyclic right-shift of one vector with a corresponding cyclic left-shift of 
the other vector inside the first £00 norm term. Finally, the triangle inequality is used once again in (d). 
We can now maximize both sides of ([24|) over j G {\X\ — 1), to obtain 



max 

J6(|.^|-l> 



upd 
Pi x — P 



train 



> max 

je{|^|-i> 

(a) 



(.-3) 

emp 



i.Z'^)-P.mp{Z 



2t 



max [p^^^JZ''^) - min [p^^-^JZ"^) - 2r > Pemp(^ " Pu 



(b) 



(25) 



2t 



Where max(-), min(-) return the maximal and minimal element of the vector argument, respectively. Tran- 
sition (a) holds since the maximization is over the £00 distance between a vector and all its cyclic shifts, 
and for (b) to hold with a strict inequality we further assume that P(,,j-jp(Z''p) is not precisely uniform, 
which is satisfied by setting t < r^, since 



< 



train / yb-p \ 



< 



PcmniZ''') - Pn 



Finally, combining the above with (|25p we obtain 



max 

je(l^|-i> 



P(^J) P 



> - 3t 



If we set — 3t > then the above together with the update decoding rule (|22p imply that the ith update 
bit is correctly decoded in this case as well. Therefore, we now set 



Tu = 2t , 



5r 



(26) 



and continue our analysis henceforth depending on the parameter t = T(n). As we have just seen, this 
selection guarantees that the event Ei U E2 is indeed a necessary condition for an update decoding error 
within the block. 

Let us now bound the probability of the event Ei . To that end, we note that Lemma [6] together with 
the a-normalization used in the definition of the training estimate, facilitate the use of Lemma [S] Since 
the training pattern sequence has a marginal distribution ^ Bcr((7) with q = we obtain 



p 



-PcrnpiZ'^) 



> T < 2|A'|exp 



v 



< 21^-1 exp ( -^T^m^fc-i ) = e']^\n) 
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Bounding the probabihty of the event E2 is similar, and Lemma [6] together with ((23|) facilitate the use 
of Lemma [5] for any of the repetition pattern sequences. These sequences all have a marginal distribution 
~ Ber(g) with q = ^ where s — 2{\X\ — l)[log(n + 1)] was given in (|21[1 . Using Lemma [S] and 

applying the union bound over all update bits and input pairs (i.e., all repetition pattern sequences) leads 
to 

¥{E2) < IP( PeM)-Pomp(^'"') 



iG{s),jei\x\-i) 



>t] < - l)s • 21^-1 exp - 



2(1^-1-1)2^2 



< 4W(W-l)^riog(.+ l)lexp^-^^^^^-I^^ - s[^\n) 

So far, wc have established that the probability of an update decoding error in any given block is upper 
bounded by F{Ei U £'2) < e[^\n) + e[^\n). Using the union bound over the blocks and the fact that 
£^i\n),e''i\n) do not depend on the message point or the channel, we obtain a uniform upper boimd for 
the error probability achieved by the scheme: 



sup 



Pe{n,W,eo) < nb-Ue[^\n)+e'i^\n)) ^ ei{n) (27) 
.1) ^ ^ 



From the expression above it is easily verified that if b{n),m{n),T{n) are selected such that r'^m'^b"^ = 
w (log^ nlog (6^^rilogn)), then ei(n) — > and the error probability tends to zero uniformly as desired. 
However, since this is a variable rate scheme, a low probability of error is not enough since while not making 
an error indicates we have correctly decoded bits, it does not indicate how many. 



V.2 Rate 

Due to the inherent randomness generated by the transmission scheme and the possibly random actions 
of the channel, the rate achieved by the scheme is randorrF^. In this section we show that this rate is 
arbitrarily close to the empirical capacity of the channel, with probability that tends to one. 

At the first stage of the proof, we look only at regular positions (which are used for Horstein iterations) , 
and analyze the rate w.r.t. these channel uses only. Later, we make the necessary adjustments taking into 
account the negligible effect of non-regular positions as well. For an accepted block, the number of regular 
positions is in the range (&minj ^max), where bmhi = bp — 4m , 6max = bp — m. Let be the (random) total 
number of regular positions over the entire transmission period, and let (3 G [0, 1] denote the (random) 
fraction of these positions that are accepted (namely, reside in accepted blocks). Communications (via 
Horstein iterations) take place only on accepted regular positions, namely over f]n^'°^ channel uses. The 
KT estimates used by the decoder are updated at varying intervals, but these intervals do not exceed 2&i„ax 
(measured relative to the sequence of accepted regular positions). Hence, the estimator used in effect is a 
KT(26max) estimator over a sequence of length /3n''®s. 

Define Vq to be the event where no update decoding errors have occurred, and let Vi be the event 
where none of the blocks were discarded due to a too small or too large selection of Mt,Mu made by the 
receiver. Given Vq, the KT estimates are based on noiseless observations of the noise sequence. Given 
Vi, we have nb'^bmin < fi™^ < 'nb~^bniax- Let i?™^ be the (random) decoding rate measured over regular 
channel positions only (including both accepted and discarded blocks), and denote by p^^^ the (random) 
empirical distribution of the noise sequence over accepted regular positions (entire transmission period). 
Using (|20p and substituting n Prf"^ and b 2fe,nax we have that given Vb H Vi 



> p (log 1^1 - HK'I - A-, ^l^l^'-;^f"'°'' ) . H log m - HK')) - A,»=M^^ 

> R, - 2\XIK, . JL . !l12£I! (28) 

where R^, = P{log\X\ - i/(p^'=s)). 

^■^Notc that even in the case of an individual noise sequence, the rate is still random due to training/update randomization. 
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We now focus on the principal rate term Rfj. As already mentioned, due to the concavity of the entropy 
it is expected that discarding blocks will only increase the achieved rate with high probability, as discarded 
blocks usually have a low empirical capacity. Therefore, we would like to seek conditions under which 
R/s is minimized by /? = 1 (no discarded blocks), and later show that these conditions arc satisfied with 
high probability. Denote by p'^^^ and p'^^ the (random) empirical distributions of noise sequence over all 
regular positions, and over regular positions inside discarded blocks only, respectively. These distributions 
together with p^^^ satisfy 

p-s = PpI^s + (1 _ /3)p-s 
Extracting anjj substituting into the expression for R^ yields 



R, = /3(log \X\ - H{p'^^ + /3-i(p-s _ p^g) 



Note that for any given values of p'^°s and Rp is defined only for values of (3 large enough such that 
p™^ + P~^{p^'^^ — Pd^) is still a probability vector. Now, if the derivative of i?^ w.r.t. /3 were to be non- 
positive over all the range of permissible /3, then R^ would be minimized by /3 = 1. We would therefore 
like to derive a condition for the non-positivity of the derivative. 

Lemma 7. For any given p^°^,p]^^ and corresponding permissible (3, 

^ < log|A'|-iJ(p-^)-i?(p7||p-s) 

Proof. See Appendix]^ □ 

Based on the Lemma above, the following chain of inequalities provide a £oo-type upper bound on the 
derivative: 

^ < log|A'|-F(p7)-D(p-'^||p-s) § |^|iog|A'||lp-'5-pJl^-^||p-8^p-s|,2 

< lA-lloglA-l -PJloo - ^ WpT-p'^^'C 

< \X\ log \X\ llp-^ -pJ^--±- - pJI^ _ - pjj-" 

where || • ||i is the Ci norm. Transition (a) is due to Pinsker's inequality for the relative entropy [29| and 
the Coo bound for the entropy (Lemma[T]), transition (b) holds since the £i norm dominates the Coo norm, 
and in (c) we used the triangle inequality. Thus, a sufficient condition for < is given by 

llP^^^-pJIoo-br-PJIool > V21n2.|A'|log|A'|.||p7-pJ|J^ (29) 
Practicing some algebra, it is easily verifiecf^ that a sufficient condition for (P^)) to hold is given by 

I|p''°^-Pjloo > 2|A-|.||p7-pJ|i, (30) 

and so given ((50]) we have i?^ > i?^^! ~ log \ X\— H{p'"^^). Using it is therefore established that ([50]) 
together with Vq n Vi imply that 

> log \X\ - Hip''^) - 2\X\K2 ■ ■ (31) 

f^min ri 

We would like to obtain a similar result involving p^^^-^p{Z"') and the rate i?„. To that end, let e [0, 1] 
be the fraction of regular positions (out of n), and so i?„ = r]R'^°s. Let p'^'^°s denote the distribution of the 
noise sequence over non- regular positions. Given the event Vi we have 77 > and so 

||p„,p(Z") - = - pj + (1 - r;)(p"-8 -Pu)\\oo<V ^pJL + i^-v) (32) 

< l|p'"^-pJloo + ^^^ < ||p''=^-pJ|«,+4m6-i(l + 21og6) < ||p-s - pj|^ + A'amri log6 



< 1, ln2 < 1, \X\ > 2 and zlogx < ^ for x = 2,3, 
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for some K3 > 0, where we have used the convexity of the norm, and the fact that < 1. 

Furthermore, using the concavity and nonnegativity of the entropy we have that given Vi 

ifemp(^") = iJ(Pemp(^")) = ^(w"^ + (1 " ^)P'""^) > > ^-f^ H {p''^'^) (33) 

Now, introduce an auxihary parameter j{n) = o(l) chosen to satisfy ^ > K^mb^^ log 6, which is made 
feasible by requiring mb^^ log 6 = o(l)- Define the events 

V2 ^ [2\X\W^^-pji<l} , - {l|Pcmp(^")-P„L>7} (34) 

Now, using it is readily verified that Vi n V2 n V3 implies ([50)1 . and therefore n?=o implies (|3ip . 
Furthermore, Vi implies (f33| . Putting ((3T|) and ([33l) together, we establish that given ni=o 

i?„ = > ^ . ^rcg > ^ flog \X\ -^i7„„p(Z") - 2\X\K, ^ 



Oni i n Oni i n 



n 



> log lA-l - ifcmp(^") - log lA-l • ^4^^^ - 21^-1X2^^ (35) 

on 

for some K4 > 0. Moreover, given the £00 bound for the entropy (Lemma [T]) yields 

log \X\ ~ H,^p{Z'') < 7 \X\ log \X\ (36) 

which enables us to remove the dependence on the event V3 by incorporating the above into the redundancy 
term. Namely, since {Pli'^o ^ ^ ^3} — C\o=i ^1 '^^ "^^^ combine ((55)) and and the definition of the 
empirical capacity, to obtain 

p(i?„(W, 0o) > Cr^iW, 9o) - e2in)) > P (^f] v}j (37) 

where 

e.(n) ^ log|^|.7^4^ + 2|^|if,^+,|^|log|^| (38) 
n 



To conclude the rate analysis, we need to lower bound the probability of no=i ^'■^^ 7 as a 

function of the scheme parameters. While analyzing the error probability, we have already established that 
P(^'^) < £i{n)- We note that ¥{Vf) is simply upper bounded by the event where at least one of 2nb^^ 
Binomial r.v.s ~ B{bp, jp) (the Alt, Mu of all the blocks) deviates by more than ^ from its expected value. 
Applying (say) the Hocffding inequality [25| and then the union bound, we obtain 

P(^^i^) < 4n6-icxp < 4„6-icxp (^-^niH-"^^ ^ ei'\n) (39) 

For V2, we have V2 Q V2 where V2 is the event where at least one discarded block has an empirical 
distribution q^'°^ over regular positions which does not satisfy the condition defining the event V2, namely 

21^-1 !|g"-^^-pJli>^ 

We would like to obtain a corresponding necessary condition on the deviation from uniformity of the 
training estimate, the probability of which we can then bound. Using norm properties, it is easily verified 
that 

2 • 4m 8m , ^, , _j 



!p„„p(Z''^)-pJ|oo-||q"^-p„ 



< = < ifgmfe" 

Omin b — 8m I log 1 — 4m 



for some > 0, where p^^p{Z^p) is the corresponding empirical distribution over all passive positions 
in the block. Hence, we have V2 C V2 C V2 where V2 is the event where for at least one block with an 
empirical distribution over passive positions p^^^p{Z''p) and training estimate p'''^'"^ simultaneously satisfies 

\Pon.piZ'"')-Pu\\^ > - K,mb-^ , -P„L < = 5^ 



21 



V.3 Parameter Selection and Asymptotic Behavior 



V ANALYSIS 



Hence, using the triangle inequality a necessary condition for V2 is for the training estimate in some block 
to deviate by at least 



ptram _ ^ 



cmp 



> 



7 
41^-1 



Kt^mb 



5r 



(40) 



We would now like to set 7 so that the right-hand-side of the above is strictly positive, and such that 



> K^mb ^ log 6 which was previously required, is also satisfied. This is obtained by setting 7 to 

Kr,mb~^ — 5r = Kg (t + mb~^ log b) 



MX 



(41) 



with Kq > large enough. Using Lemma [5] and the imion bound over blocks, we get 



> Kq {r + mb ^ log b) 



< 2\X\nb-^eyij>[~-Ki{mb-Hogb + T) m^b- 



and finally. 



p ( n ) > 1 - E ^(^^') > 1 - (^1 (") + 4'^ (") + 4'' 



£3("-) 



(42) 



Note that £3(n) — 0{ei{n)) and hence e^{n) under the same condition provided for the error proba- 
bility in subsection lV.il 

Summarizing, wc have found that a rate e2{n) close to the empirical capacity (eq. psp with 7 given 
in (HI])) is achieved with probability at least 1 — £3(71.) (eq. (IH])), and an error probability no larger than 
£i(n) (eq. (^7)) ). In passing, we have also described a set of constraint on the asymptotic behavior of the 
scheme parameters that are sufficient to guarantee that £i(n), £2(71), £3(71) — > 0. In the next subsection we 
summarize these constraints, and show that there exist (many) selections of scheme parameters for which 
they are satisfied. 



V.3 Parameter Selection and Asymptotic Behavior 

There are many different selections of the scheme parameterj^ bin), m{n), T{n) which allow all the conver- 
gence parameters £i(n), £2(71), £3(71) to become asymptotically negligible, and result in various trade-oflES 
between them. The following is the set of sufficient asymptotic conditions to that end, derived directly 
from the discussion in the two previous subsections: 

(I) r = o(l) (IV) 6n-ilogn = o(l) 

(II) m6-ilog6 = 0(1) (V) T^m^fo-i = w(log^nlog(6-inlogn)) ^ ' 

Note that the above conditions also imply that b ~ Lu{logn), which was an assumption made when com- 
puting the number of update bits. Under (j43p . the following asymptotical behavior for the convergence 
parameters is achievable: 



Error probability £i(n) 


-log£i(n) = Q 1 


, log^ 71 J 




Target redundancy S2{n) 


£2{n') — 0{bn^^ logn) + 


(^y^T + mb ^logfe^ 


Redundancy exceeding probability £3(^1) 


-log £3(71) = Q 1 


, log^ ri ) 





To demonstrate that the sufficient conditions ([43]) can be met, let us specifically set the parameters to 

b{n) ^ n"" , m{n) = n"' , r(n) = n^"^ (44) 

for some positive constants ag, ai, a2. The conditions then translate into 

ai < ao < I , flo < 2(ai — 02) 

It is easy to find many parameter selections satisfying the above conditions, and one possible selection is 
given by (00,01,02) = (|, 5, j^). 

Recall that the thresholds Td{n), Tu{n) are determined by T(n), as given in I I26I I. 
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V.4 Noise Sequence Channels 

As already mentioned, the family of noise sequence channels is a subfamily of the family of modulo- 
additive channels and therefore the analysis presented thus far specifically holds when communications 
take place over an unknown member of ■ Specifically, this is also true for the special case of an individual 
noise sequence, which was given as a motivating example in section |T1 However, it turns out that when 
transmission takes place over the family ^/(^ it is possible to obtain better convergence tradeoffs than when 
operating over ■ This is achieved by the following simple modification within each block: The sequence 
-g (jj-a,wn uniformly over the type of all sequences with exactly m training positions and exactly m 
update positions. The sequence F™ (note that now M„ = m) is drawn uniformly over the type of all 
sequences with a uniform compositioiv^l over the alphabet {{\X\ — l)s). 

These changes amounts to using a fixed number of training/update positions, and using a fixed repetition 
code for each update bit with each input pair, which means that position types arc not selected in an i.i.d. 
fashion anymore. Thus, a fully informed adversary can now predict the type of the next position with some 
accuracy, and possibly exhibit 'atypical behavior" accordingly (say over training positions), rendering the 
scheme useless. For noise sequence channels however, the noise sequence is "generated separately" from 
the input/output sequence (in the sense described by the Markov relation hence the adversary cannot 
change its strategy based in its ability to predict, which is why the scheme can still work. The most basic 
example for that is the case where the noise is an individual sequence, which is fixed at the beginning of 
time and cannot adapt according to the observed inputs/outputs. 

The only derivation in the achicvability proof that needs to be modified is that of the deviation prob- 
ability of a sample's empirical distribution from the true empirical distribution, where this sample is now 
uniform over a type. To this end, we can use Lemma |4] (sampling without replacement) in lieu of Lemma 
[5l as the noise sequence is now statistically independent of the (training, update) sampling sequences. 
Interestingly, the exponential decay of the deviation probability is linear in the number of samples (either 
m or 2(|;y| i)"[iogrt+i] '^^^ case) and docs not involve the length of the sequence sampled from (bp in our 
case). Therefore, the expressions for £i(n) for noise sequence channels is essentially given by exchanging 
the expressions b~^m? — > m and [log(n-|- 1)]^ — > [log(n-|- 1)] in all the exponents (up to constant factors). 
A set of sufficient conditions for communications over is given by 

(I) T = 0(1) (V) T^7Ti = (lognlog (6^^nlogri)) 

(II) mfc-Mogfo = 0(1) (VI) b = uj{\ogn) (45) 

(III) 6n-Mogn = o(l) 

and the corresponding asymptotical behavior of the convergence parameters is given by 



Error probability £1 (n) 


-logei(n) = f7(g|) 




Target redundancy e2{n) 


e2(n) = 0{bn-'^\ogn) + (^i/r + 


mb^^ logfo^ 


Redundancy exceeding probability e^{n) 







Finally, note that moving from i.i.d. sampling to fixed-size sampling without replacement is fundamental 
for reaping this performance gain in noise sequence channels, since even when the noise is independent of 
the sampling sequence, the tail of the binomial distribution renders i.i.d. sampling inferior. 

V.5 Randomness Resources 

Randomness is a key element in achieving the empirical capacity. Let us examine just how many common 
random bits are consumed by our scheme. The receiver generates 0{nb~^m\ogb) random bits over the 
entire transmission period. Under the parameter constraints (|43p or ()45p this amount is sub-linear in n, 
as otherwise it could not be accommodated by feedback. It is easily verified that these conditions imply 
that for any e > 0, the foUowineamount of randomness is sufficient for achieving the empirical capacity, 
for the different channel families ^1: 

^^We assume m divides the size of the alphabet, however using a close to uniform composition works as well. 
^^The parameters b, m are provided, it is readily verified that r can be set to satisfy the required conditions. 
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Family of Channels 


Random Bits 


Generation Mechanism 


Parameters 


Noise Sequence .yVx 


0(log^+-' n) 


Feedback 




Modulo- Additive .-C^ 




Feedback 




General Causal '^^ 


0(n) 


Common Randomness 


Any feasible selection 



Interestingly, the randomness resources consumed are significantly reduced when operating over i-C, 
over noise sequence channels. Thus, in a sense it seems that when working against an adversary most 
of the randomness resources are dedicated to "decoupling" its actions from the channel inputs/outputs, 
and only a negligible amount of randomness is used for combating the "noise effect" itself. In the most 
general case of communication over "rfx , a much larger amount of random bits (mainly used for dithering) 
cannot be accommodated by the feedback link, and an external common randomness source is required. 

VI Summary and Discussion 

The universal communication problem over an unknown discrete channel with noiseless feedback was ad- 
dressed. An extreme channel uncertainty model was considered, where the channel law is unknown to both 
transmitter and receiver, and may vary arbitrarily from symbol to symbol depending on previous inputs 
and outputs, possibly in an adversarial fashion. Although in such a general setting no positive rate can be 
guaranteed in advance, it was constructively shown that reliable communications at a variable rate that 
corresponds to the empirical goodness of the channel, can be attained. As a measure for this empirical 
goodness, the empirical capacity of the channel was defined as the capacity of an equivalent memoryless 
modulo-additive channel, with an additive noise marginal distribution given by the empirical distribution 
of a noise sequence realized by channel actions throughout transmission. An explicit sequential transmis- 
sion scheme was then described, and shown to achieve rates arbitrarily close to the empirical capacity with 
probability approaching one, independent of the actual channel in use and uniformly over the message 
set. For the special case of individual noise sequence channels, the scheme is universal in the sense of 
successfully competing with any fixed-rate transmission scheme that knows the empirical distribution of 
the noise sequence in advance. 

Achieving the empirical capacity requires randomization. This is especially evident in the case of 
an individual noise sequence channel, where it is well known that deterministic coding schemes cannot 
attain the empirical capacity uniformly over the message set in general, even if the empirical distribution 
of the noise sequence is given in advance. Consequently, the described scheme requires the generation of 
common random bits shared by the transmitter and the receiver. In the most general setting, 0(7i) random 
bits are used by the scheme, a quantity requiring an external source of common randomness available to 
the terminals. However, if the channel law is known to be modulo-additivc at any time instant (but 
otherwise arbitrary varying, depending on previous inputs/outputs), only ©(y^log^^"^ n) random bits are 
sufficient for any e > 0, an amount that can be generated exclusively via feedback at no asymptotical cost. 
Furthermore, in the special case of noise sequence channels (where the channel is completely defined by the 
noise sequence) the scheme exhibits improved performance in terms of error probability and redundancy, 
and the amount of common randomness is further reduced to merely 0(log^'*'^ n) random bits, which again 
can be produced by feedback alone. 

The tradeoff between error probability and transmission time attained by the scheme is sub-exponential 
in n. This is to be expected, since the actual channel over which communications take place might be (say) 
a BSC, in which case the empirical capacity converges to the channel capacity a.s., so one cannot hope 
to universally obtain a positive error exponent when operating at the empirical capacity. However, if one 
is willing to give up a constant portion of the empirical capacity then it is plausible that a positive error 
exponent could be universally attained, yet we were unable to adapt our scheme to that end. In order 
to make the errors due to training estimate deviations vanish exponentially with n, a linear number of 
training positions must be set, which in the finite-horizon setting implies a constant number of blocks. 
This however results in an excessively slow update rate for the KT estimate which prohibits any positive 
rate from being attained, and it therefore seems that an altogether different approach is required. 
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A PROOFS OF LEMMAS 



In this paper the discussion was limited to a memoryless modulo-additive model, where universality 
is sought w.r.t. the marginal empirical distribution of the realized noise sequence. In part two of this 
work , the concepts presented here are further developed to encompass more general models for channel 
actions. Specifically, wc discuss models that take into account empirical dependencies between the channel 
actions and the input, and exploit empirical memory within consecutive channel actions. This approach will 
facilitate universality w.r.t. higher order empirical statistics, achieving the empirical capacity corresponding 
to more elaborate models. 

A Proofs of Lemmas 

Proof of Lemma\^ Let p = (po,Pi, ■ • ■ iP\x\-i)- a-nd assume without loss of generality that min(p) = po- 
For any i G , define 

p« (0A^,1,0^_^) 

i \X\-i-l 

and let p„ be the uniform distribution over X . Express p as the following convex combination: 

\x\-i 

p = PQ-\X\- p„ + ^ {Pi- Po) ■ p''^ 
1=1 

Using Jensen's inequality we get 

H{p) >Pu-\X\-H (pj + ^ (p, - po) • H{p^^) = PO • \X\ log \X\ (46) 

1=1 

Now, by definition 

— pT ^Po^ ^'Hd hence Po > ^ ^ IIp^PuIIoo- Substituting this into 

(|46l) . we obtain 

H{p) > log|A'|(l-|A'|||p-p„|| 



as desired. Note that (j46|) is in fact a uniformly better bound, but the Coo bound is sufficient for our needs 
and easier to work with. 

□ 

Proof of Lemma [3^ Let £1, be the position within the sequence z" of the fcth appearance of the ith symbol, 
and denote d = d{z^,w'"'). We bound the excess redundancy incurred by using the KT(&) estimator with 
noisy observations, assuming that b + d < n + 1, as follows: 

h + d-l 



< 1^-1(6 + d-l)log(2(6 + d-l)) + 1^-1 ^ log 1 



n — 1 

< \X\{b + d- 1) log(2(6 + d-l)) + \X\{b + d- 1) log e y ^ 

< lA-K^ + d- l)log(2(6 + d- 1)) + lA-K^ + rf- l)(21oge + log(2n - 1)) 

< 2|A'|(5 + d- l)log2ne 

This completes the proof for b + d < n + 1. For b + d > n + 1, we note that the maximal possible 
excess redundancy per symbol at each step is upper bounded by log(2n + \X\ — 2), hence the total excess 
redundancy is upper bounded by 

nlog(2n+ \X\ - 2) < (5 + fi - 1) log(2?i + | A"] - 2) < 2\X\{b + d- l)log27ie 
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where the last transition is a simple easy exercise using the fact that \X\ > 2. 



□ 



Proof of Lemma^ We first prove the Lemma for X = {0,1} and a deterministic = z", the general 
case then results as a simple corollary. Under these assumptions, we have 



( ||Pemp(~-") -Pemp (^"ii?")|L > ^) = ^ ( bc„.p ) " Pcmp (^"iS")! > t) 



(47) 



k=l 



k=l 



> T 



Hoeffding showed that the distribution of the sample mean for sampling without replacement from a finite 
(deterministic) population, obeys the same bounds for deviation from the mean as the ones he obtained 
for an i.i.d. sample with mean equal to the empirical mean of the population [25l sec. 6]. Following this, 
let A"^ be an i.i.d. ~ Ber (i ■^fe) sequence. Using the standard Hoeffding inequality [IS sec. 2], we 

obtain 



n ^ — ' m — ' 

fe=l k=l 



Ak 



> r I < 2cxp (-2mr^ 



and by Hoeffding's claim above, this bound also holds for P7|) . This concludes the proof for the binary 
alphabeiEl with a deterministic Z". For a stochastic Z" the result stands since B" and Z" arc independent, 
and the bound holds for any realization Z" = z". For a larger alphabet, one can define an indicator sequence 
for any symbol i £ X, namely a sequence whose fcth clement is given by Z^. (jj — ti(Zk). The proof 
then follows from the binary alphabet analysis, and the union bound over all the symbols. □ 

Proof of Lemma\Bi For any i e A", define the r.v. sequence A^^-^ of length n, whose /cth element is given by 



Ak,i,,)=Y.^^{Z,)-q-H,{Z,)-B, 



We have that 



^Ak+iM 14) = a') = + nUZk+i) ■ (1 



ak 
ak 



-'E{Bk+i\A 



(0 



)I4)=«') 

C(l,(Zfc+i)|4)=«') 



where in the last two transitions we used the causal independence assumption the fact that Bk ^ Ber(g). 
Similarly, we also have that (i)) = 0. Therefore for any i Cz X , is a zero mean martingale with 

differences that are bounded by — ^fc,(i)| < ma.x{l,q^^ — 1} < q^^- By the Azuma-Hoeffding 

inequality for bounded-difference martingales |26| . for any r > 

P( |Ai,(i)l > ^ 2exp 
The result is now established as follows: 



9 ? 



( ||Pen.p(^") - «(i?")Pemp (^"iS")|L > ^ = ^ l U 



1 



k=l 



nq 



k=l 



MZk)-Bk 



> T 



< 



lAuwl > nr) < 2\X\cxp - 



9 9 

nr^q^ 



□ 



^In the binary case we get a coefficient of 2 instead of 4 multiplying the exponent. 
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Proof of Lemma To avoid heavy indexing we prove the result for the first block and the training pattern 
sequence, this is easily extended to any other block and to the repetition pattern sequences. For 1 < fc < 6, 
let Ak be a r.v. denoting the exact usage of each position, thus taking values over a generic alphabet 
A = {active, regular, training} U — l)s) where the numerical values correspond to update positions 
for a specific update bit using a specific input pair, as described in subsection IIV.5I For brevity, we omit 
subscripts and write P(x|y) for ¥x\Y{x\y)- Summations arc understood to be taken over all feasible values 
of the summation variables. Let us now prove that Ak+i and {Z'^~^^ , A'') are statistically independent 
for ha < k < h, from which the result then follows immediately, since the training pattern is simply an 
indicator sequence Bk — ltrainmg(^fc+b„ )■ We have, 



(a'=)P(z'=+i|a'=) ' P(z'=+i|afc) 

where in the second transition we used the fact that by construction, A'^" is a constant deterministic 
sequence and A\ is an i.i.d. sequence. It is therefore sufficient to show that Z^^^ ^ ^ ^fc+i- To 
that end: 

P(z'=+i|a'=+i) = P(z'=+V^y^a'=+l)P(a;^2;'=|a'=+l) 

= ^ P(z'=|x^2;^a^+l)P(zfc+l|z^x^y^a'=+l)P(a;^/|a^+l) 
= ^ P(z^|x^y'^)P(zfe+l|x^y^a^+l)P(x^/|a'=+l) 

x^y^ 

^ ^ P(z'=|a;^/)P(zfc+l|a;^2/^a'=)P(a;'=|y^a'=+l)P(/|a'=+l) 



i P(^'=|x^/)P(zfc+l|z^2/^a^)P(x^|y^-^a^■)P(/|a^■+l) 

k 

k 

i Y p(^1x^2/'=)P(zfc+l|x^y^a^)P(x'=|/-^a'=)^p(%l2/^"^a^■) 

x^y^ j — 1 

® P(z'=+i|a'=) (48) 
Where the transitions are justified as follows {ba < k < b): 

(a) Zk ^ x'^-^Y''-^ ^ A''. 

Proof: We easily find that Zk ^ x'^~^Y'^~^ ^ U^~^ by combining Zk ^ x'^'^Y^'^ ^ Xk given 
in ([6]); with Zk j^fcyfc-i ^ [/'^'^^ which is equivalent to ([9]). The relation now follows since by 
construction A^ = ixmc{U^'^). 

(b) = imic{0(3,Y^~^ , A^), by construction. 

(c) Yk ^ Y^-^A^ ^ A\^^. 
Proof: 

P(yfc|/-\ a'') = ^ P(yfe|/-\ a^)P(x^|/-\ a'') = ^ P(yfe|/-\ x^ a'^-)P(cc'^-|y^^ 
= P(y,|y'=-i,a'^-) 

where in the second transition we used (|b| above together with Yk ^ ^fcyfe-i ^ j^b ^ which stems 
from ^ using the fact that by construction A^ = func([/''"). 
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(d) The dependence of the expression on ak+i has been removed. 

□ 

Proof of Lemma^ Let p, q be two distributions over X. Then for any a G such that p + a{q — p) is 
also a probabihty distribution, 

^H{p + a{q-p)) ^ - y2{qt - Pt)log{Pi + a{qi - p,)) - V fa - Pi) ^' 'I" ^loge 

oa ^ — ' ^ — ' Vi + alq.; — n, 



= - ^ -P') ^'^SiPi + a{qi - p,)) 

iex 

\ [^^P^ + "(9* -Pi)) log(K + a{q.i - Pi)) ~^Pr \og{pt + a{q, - Pi))\ 
\iex iex ) 

H{p + a{q - p)) - H{p) - Dip || p + a(q - p))] 



a 

1 



Using the above, we have that for any (3 G [0, 1] such that p + (3 ^(q — p) is a probabihty distribution, 
^{l3{log\X\-H{p + l3-\q-p))) } = 

= fog \X\ - H{p + r\q - P)) + (h{p + p-\q - p)) - H{p) - D{p \\ p + p-\q - p)) 
= fog \X\ - H{p) - D{p II p + r\q - P)) 

< fog lA-l - H{p) - l3D{p II p + r\q - P)) + (1 - P)D{p \\p) 

< log\X\-H{p)-D{p\\0{p + P~\q-p)) + il-p)p) < log\X\- Hip) -D{p\\q) 

where in (a) we have used the convexity of the relative entropy. The proof is concluded by substituting 
p, q with p'^^^jP'^'^^ respectively. □ 



B A Horizon Free Universal Scheme 

In this subsection we show how the presented finite-horizon feedback transmission scheme can be trans- 
formed into a horizon-free scheme, with an instantaneous rate approaching the empirical capacity. To 
motivate this generalization, suppose one wishes to transmit a fixed number of bits using the finite-horizon 
scheme. In this case, it may be that capacity-wise, the receiver could have potentially decoded enough 
bits half way throughout transmission, and even worse - could not do so when transmission ends due to a 
deterioration in channel conditions. In this case it is therefore critical that the transmission can be stopped 
at any given time, while achieving the instantaneous empirical capacity. 

The idea is that instead of taking a fixed transmission period n and dividing it into blocks of a fixed size 
b{n), a variable block size is used, growing with time. The apparent difficulty with this approach is that, 
in contrast to the finite-horizon case and although the size of the last block is increasing, the size of any 
specific block is constant, and a non- negligible update decoding error probability in each specific block is 
incurred. This in turn results in two problems. First, bounding the error probability as before using a union 
bound over update decoding error events, provides a non-vanishing bound dominated by the first block. 
Second, the resulting KT estimates use noisy observations, which incurs a redundancy penalty. The first 
problem is essentially solved by making sure that in the event where the last accepted block is not "recent 
enough" , no bits are decoded. Loosely speaking, this event implies the empirical capacity is small anyway 
with high probability, hence the resulting excess redundancy is negligible. As for the second problem, for 
a suitable selection of scheme parameters we can show that with high probability, the hamming distance 
between the noise sequence and the corresponding noisy observations sequence increases slowly enough, so 
that the excess redundancy becomes negligible. 

Following this discussion, the horizon-free scheme is obtained via the following modifications of the 
finite-horizon scheme: 
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(A) The size of the fcth block is set to 6^, where {bk G N}^]^ is strictly increasing. In our proof, we use 
an arithmetically growing block siza^i. i.e., bk = bo + k for some bo € N. 

(B) The parameters of the fcth block arc fixed functions of its size bk-, i.e., mk = m{bk)^Tk = T{bk), etc. 

(C) On the fcth block, the update information consists of the type of the noise sequence (symbols oc- 
currences vector) over regular positions in the previously accepted block, and the index of the corre- 
sponding message interval. Let Uk = Sj=i be the number of channel uses in the first fc blocks. The 
number of uncoded update bits in the fcth block is therefore given by replacing 6 — > bk-i , n — > Uk-i 
in the left-hand side of (HJ). 

(D) Transmission can be terminated at any point. When terminated, the receiver normally decodes the 
binary interval pertaining to the last known message interval (from the last accepted block) using 
the corresponding ambiguity resolving bit. However, if transmission ended during the fcth block and 
the last accepted block fcacc is not recent enough, namely fcacc < Pk for some predetermined recency 
threshold pk, then the decoded interval is [0, 1), i.e., no bits are decoded. 

We now turn to prove that this modified scheme achieves the empirical capacity for a suitable selection of 
parameters bk, rrik, Tk, Pk, where the thresholds r^^k and Tu,k are determined by Tk as before. For simplicity 
of the exposition, assume the length of the fcth block is bk = fc, and choose ruk and Tk to be 

mk = bl' fc'^i , Tk ^ 6^"^ = k-''- 

for some constants 01,02 S (0,1). For brevity, we mostly disregard non-integer issues throughout this 
section, as these have no asymptotic effect. Let rik = ~ M^^tll, The number of uncoded update 

bits in the fcth block is zero padded up to Sk, which is given by 

logCr'j + [log (1 + m - l)nfe_i) j < (lA-l - 1) logfc + log((|A'| - l)fc(fc + 1)) < 21^-1 log(fc + 1) ^ 

(49) 

Following the same derivations as in Section |Vl we have that 

^{Ef^) < 21^-1 CXp (^-ir^«^6-l) = 21^-1 CXp 2(a,-a.)-l^ 

P(i5f' ) < 41^-13 riog(fc + 1)1 exp ( -—-±-JU^—^ = 4\X\^ [log(fc + 1)1 exp -^^^^^ 



iA'|4riog(fc + i)r/ V 8|A'|4riog(fc + i)r 



where E'f^Eif' are the events in the fcth block corresponding to Ei, E2 defined in subsection lV.il Thus, 

(k) (k) 

E\ ' U E:^ constitutes a necessary condition for an update decoding error in the fcth block. At this stage 
in the finite-horizon proof, we have used the union bound over update decoding error events in each block 
to obtain an upper bound for the error probability in the finite-horizon scheme. However, in this case 
taking the union bound would result in an error probability that is dominated by the first block, hence 
not decaying to zero. From this point, assume the transmission scheme was terminated at time n = rik 
after precisely fc arithmetically growing blocks were senf^ which means that \/2n — 1 < k < \/2n. Let 
us divide the transmission period into two batches: The first batch includes the first fc°^ blocks for some 
as S (0, 1), while the last batch includes all the rest fc — fc°^ blocks. Let us also set the recency threshold to 
be Pk = fc"^ , which means that if the last accepted block resides in the first batch, no bits arc decoded. 
Define Vq to be the following event: 



^0 - n {(^PT^(^'T} 



Other block size increments are possible, resulting in different trade-offs between error probability and convergence rate. 
For instance, one can use the recursion = {^jZi ^jY ^™ some u £ (0, 1). 

■^"In general, the scheme may be terminated arbitrarily in the middle of some block, and decoding will be carried out w.r.t. 
the end of the previous block. Since the maximal block size is 0{y/n), this can cause a maximal rate fluctuation of 0(n~2 ), 
which should be added to the redundancy term e2iji), but turns out to be asymptotically negligible. 
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Using the same ideas as in the fixed-horizon analysis, we can show Vb imphes that no update decoding 
error occurred in the last batch. Due to the recency threshold, this implies in turn that either the decoded 
message interval is correct, or that no bits are decoded. Therefore, Vq is a necessary condition for an error, 
and so 

(\ / jL as (2(ai — a2) — 1) 

£:f U ) < 5\X\^k riog(fc''^' + 1)1 cxp 

Do e 10. 1) 



8\X\^ [log(fc°-3 + 1)] 



2 



, / ^/O^T- 11 03(2(01 -aa)-!) \ 

< 5\X\^V2^\\og{{2n)^ + 1)1 cxp '- ^ ei{n) (50) 

" I Bu ; ;i 8|A'|4[log((27i)^ + l)]2y ' ' 

where we have used the union bound over blocks in the last batch, and the fact that the update decoding 
error probability of the first block in that batch dominates the others. We get: 

(03(01-02-^) \ 
—r-2 (51) 
log n J 

so the error probability tends to zero uniformly for any selection ai — 02 > 5. This concludes the error 
probability part of the proof. 

We now show that at any time point, the decoding rate attained by the scheme is close to the empirical 
capacity with probability approaching one. Let Vi be the event where none of the blocks in the last batch 
were discarded due to an improper selection of Mt, Af„ made by the receiver. Using Hoeffding's inequality 
as in pO]) . it is readily verified that 

- logP(Ui") = n (n^^^'^i-i)) (52) 

and so for any selection ai — 02 > ^ both P(Vo),P(Vi) 1. Now, let p^°^'^ be the empirical distribution 
over accepted regular positions in the first batch, and p'^^'^ the corresponding distribution in the last 
batch. Let us express p^°^ as 

p-g - Ap-S'f + (1 - A)p-s-^ 

Due to possible non-negligible erroneous update decoding, the receiver might use noisy observations for 
its KT estimates. In the finite-horizon case this problem was averted since the update error probability 
in each block was negligible, and so the event of noisy observations had a vanishing impact incorporated 
into the redundancy term e^{n). However, in the horizon-free case there is a non- vanishing update error 
probability dominated by the first blocks. Nevertheless, under Vb only the first batch may include erroneous 
blocks, and thus the hamming distance between the actual noise sequence (over accepted regular positions) 
and the one used by the receiver when updating the KT estimates, is upper bounded by c? = An. Now, since 
hk < \/2n, the receiver uses a KT(2%/2n) estimator and using Lemmas [2] and [3] with noisy observations we 
have 

/ log n 



/n 



R'^s >R^-Kt[ + A log n (given Uq n Ui) (53) 



for some K^ > large enough, where i?^ = /3(log|A'| — H{p^°^y). Let be the maximal possible number 
of non- regular positions in blocks j to k (where k is the last block), i.e, 

k k 

= ^ 4mj (1 + 2 log hj )<KzY. log (54) 
where K2 was defined in (|32p . Simple algebraic manipulations yield the following bound for (3: 



1-AV n J 1-A 

for some K% > 0, where (3g is the fraction of regular positions in the last batch that were accepted. To 
continue, we need the following Lemma. 
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Lemma 8. Let p and q he any two probability distributions over a finite alphabet X . Then for any A G [0, 1] 



H{\q+{l-X)p) < H{p) + 3\X\X\og- 

Proof. Let p = (pi, . . . ,P\xi) be a probability distribution over X with nonzero elements. Let v be the 
representation of p over the jA"! — 1 dimensional probability simplex ^,i.e.,a vector of the first | A"! — 1 

elements of p. With some abuse of notations, denote by H{v) the entropy function of p, calculated over 
gl'*!-!^ We take the partial derivatives of H and get 



dHjv) , l-E, 
= log - 

ov,, V,, 



1,- 



\X\-l 



Since H{-) is concave over S''^' ^, its tangents at any point are always above it. Therefore for any 



r e 



that satisfies (t> + r) e ^, we have that 



H{v + r) < H{v) + ^ Ti log 



(56) 



Now let V and u be vectors over the | A"! — 1 dimensional simplex that correspond to p and q respectively, 
and < A < 1 some constant. With the same abuse of notations, we use ([5S)) : 



H{Xq + (1 - A)p) = H{v + X{u - v)) < H{v) + A X! " 



1=1 



H{p) + A ^ (g. - p,) log ^ < H{p) + A ^ 



log — 

Pi 



Assume for the moment that X < j. If it so happens and all the symbol probabilities satisfy Pi > X, 
then from the above we have 



\x\-i 

HiXq+il-X)p)<H{p) + X J2 



1 Pixi 

log — 

Pt 



<H{p) + {\X\-l)Xlogj 



(57) 



which satisfies the statement in the Lemma. Otherwise, assume that there are precisely t symbols that do 
not satisfy that requirement. Without loss of generality we assume that pi < P2 < ■ ■ ■ < P\x\; and therefore 
Pt < X. Define 

■0 = Aq + (1 - A)p = (-01, . . . , i'lxi) 

The first t elements of if) are all smaller than 2A. Without loss of generality we assume that al least one 
of those t elements is nonzero, as otherwise we can reduce the dimension of the problem. We have the 
following: 



(a) 



H{lp) ^ i?(?/'l+?/'|.:,|,02,-.-,l/'|;f|-l) + (V'l+V'|A-|)^£ 



01 



(b) (c) / * \ 

< i7(0i+0|.:,|,02,...,0|.:tHl)+/iB(2A) < H 2^0. +0|;,i,0, + i,...,0|.:,|_i +t-hs{2X) 



(d) 



i\X\-t-l)Xlog-+t-hsi2X) 
A 



< H{p) + lA-IAlogi + \X\ { 2Alog;^ +2Aloge ) < H{p) + 3\X\X\og 



2A 



In (a) we applied the entropy's grouping property [29], in (b) we used the fact that A < -j, in (c) we 
repeated the two previous steps t~l more times, in (d) we used (|57p since the probability vector argument 
of the entropy function has a minimal symbol probability exceeding A , and in (e) we used < i < | A"!, the 
entropy's grouping property, and the inequality hB{p) < plog ^ +ploge. This proves the result for A < -j. 
The proof is now concluded by noticing that for A > | the excess term satisfies 3 1 A" | A log j > log \ X\. □ 
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Applying Lemma [8] to i?,? and using inequality ((55|) , we have 

R, = /3(log|A-|-H(Ap-^-f + (l-A)p-s,^) 

Pi 
1 - A 

/I — 277."'' — iir877 2 log 77 



> 



1 - 277'^^'-^ - A877— log77 loglA"! -i^(p^'=s^'^)-3|A'|Alog- 



> 



1-A 



^-^-SIA-I^log^ 



(58) 



where 



R^^Pe[log\X\-H{pl^^'') 



i?^ is a quantity similar to R^, but only for the last batch. Define , p^'°^'^ , p'^J^'^ to be the empirical 
distribution of the noise sequence in the entire last batch, over regular positions, and over discarded regular 
positions, respectively. We can now repeat the finite-horizon analysis over the last batch only, using the 
parameters of the pfe-th block which is the smallest in the batch. Namely, one can set the an auxiliary 
parameter 7(77) = o(l) to satisfy (the equivalent of (|4T|) ) 



= n {rup^hpl iog6pJ + n{Tp^) = n (77 "'^ ' log 

and define the events 

V2 = \ 2\X 



< 



1 



to obtain 



and with 



R'>RU>^og\X\~H{p^^^'') 



(given Pi V^] 

i=0 



(59) 



(60) 



-l0gP(y2') = (&;^^777^^_ 7'') = (l0g2 77-r7°^("l-^'-(l-''l)) + 17 (77"^ (-^l ^ 3 ^ 

and so setting 03(01 — |) > max (1 — ai, 02) we have P(V2) 1. 

Let us no define V4 as the event where A < 77^"*' for some 04 € (0, 1). Using (|53p . (|58p and (|60p we 



gell 



> loglA"! -ff(p"s/) 

'log 77 



277°3 ^ + _fi'877^~ l0g 77 — 77 



logl-Yl 77-'^Mog(277'^'') 

3 A" 



K7 



1 - 77-°-l 

log77 ) = log \X\ - H{p''^-^) + O (77-"^= log 77) (given fj V,) (61) 



1 - n-'^i 
4 



1=0 



where 05 ^ min(l — 03, ^ , 04, i). To express (|6ip in terms of the rate i?„ and the empirical entropy 
^(Pomp(-^")) Jensen's inequality and standard manipulations, yielding (given Vi) 



Rn>\l--]R 



H{p^) > ( 1 - ^ ) i7(p-^'0 , H{p(Z-)) >[l--] H{p^^^'') (62) 



c 



where the terms was defined in (|54p and ^ is given by 



C = + E = O ("^ log") + 0{n'^') 



Notice that the minimal positive value for A is always greater than ^ (single accepted block in the first batch), and for 
A = (first batch fully discarded) the penalty term y~\ '°S y zero. 
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and corresponds to the maximal number of channel uses wasted on the first batch and on non-regular 
transmission in the second batch together. Thus, since already involves all the relevant terms, we get 

4 

i?„(W, 9o) = C„°"^P(W, 9o) + 0{n--^^ logn) (given f| V^) (63) 

1=0 

As before, under the £00 bound for the entropy (Lemma[T]) yields log \X\ — H{p^) < \X\ log | A'|7(n) = 
O (^n ^4 ^logn^ + O (^n~~r^ and using (|62p yields in turn 

C„°'"P(>V, Bo) = O (^n^v^log") + O (""^) + O (n^ logn) + 0(n"-^-i) (given V^^) (64) 

This penalty can be incorporated into the redundancy term to remove the dependence on the event V3. 

We would also like to remove the dependence on V4, and to that end consider the event V^OVa. Under 
this event and assuming 03+04 < there can be no accepted block of size ri(7i''-^+°*), as otherwise V^: is 
contradicted. Due to Vq, the empirical distribution (over passive positions) of each of these larger blocks 
is Tj close to the training estimate, which in turn is Tj close to being uniform, where j is the index/length 
of the corresponding block. Using the £00 bound for the entropy, the empirical capacity of each of these 
blocks (over passive positions) is therefore 0(rj) = 0(n~°^(''^+''*'^), where we have used the fact that tj 
of the smallest such block dominates the others. Moreover, the empirical capacity of all the blocks of size 
0{n°'^~^°"^) is 0(1) (which is true of course for any block). By convexity, the empirical capacity over the 
entire transmission period is no larger than the average of the empirical capacities over some segmentation. 
Hence, 

C„""P(>V,6lo) = O (^n-^^laa+ai)^ (^^^-2(as+a,)\^ ^^^1^ log + O(n'^^-i) (givcu V^^HVo) (65) 

where the first term is the contribution of blocks of size il(n°=* +"*'), the second term is the contribution 
of the smaller blocks (after averaging w.r.t. the fraction of time they occupy), and the last two terms 
correspond to the deviation possibly incurred by considering only passive positions. 
Let us now combine all the above results. The following inclusion is easily verified: 

4 2 

{r\^}^{vi}^MnVo} ^ f]v, (66) 

i=0 1=0 

Under the event on the left above (and thus also under the event on the right) at least one of (|63p . or 
((65)) must hold. We therefore conclude that 

p(i?„(W,0o) > C„""P(W,0o)-e2(n)) > P(n^*) ^ l-£3(") (67) 

where all the redundancy terms are now incorporated into £2 (n) using all the constraints set thus far (with 
some further relaxations): 

/ \ / —n \ A • / — ^1 ^2 / \ f 

£2(?^)=0(n Mogn) , 05 = mm I — - — , — , 1 - 03, 02(03 + 04), 04, - 
and where e3(n) < J2'i=o^i^f)^ hence 

- log £3(^1) = n (^,^«3(ai-i)-max(l-ai,a.)^ 

To conclude, we summarize the constraints on the constants o,; € (0, 1) which guarantee that Si{n) —^ 0, 
so that the empirical capacity is achieved: 

max (1 - 01,02) < 03 ^oi - , 03 + 04 < i 
There are many parameter selections that satisfy the conditions above, e.g. (oi, 02, 03, 04) = (|, |, |, j^^). 
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C A Universal Scheme for Cx Utilizing Common Randomness 

In this section we show how the horizon-free universal scheme developed in the previous section can be 
adapted, using common randomness, to achieve the empirical capacity over the larger family of all causal 
channels. To that end, we first describe a more general communication setting using common randomness, 
and later show how our scheme is adapted into this setting. Note that only passive feedback of the received 
sequence is assumed, due to the availability of common randomness. A feedback transmission scheme with 
common randomness is a triplet (G, V, A) and can operate either with or without dithering (the definitions 
and details appear below). Using the scheme over a channel W E ^ix with a message point 0o G [0, 1), is 
described by the following construction: 

• Common randomness resources are assumed to be available in the following form: 

— An i.i.d. control sequence A°° taking values over some countable alphabet with a given 

sequence of marg inal distributions P = {-Pfc(-) = 1"^^ (Ol^r 
^ An i.i.d. dithering sequence taking values over X. When the scheme operates with dithering 

then $fc ~ Uniform(A'), and when it operates without dithering we set $fe = for any fc G N. 

• {X°°,Y°°) constitute an input/output pair for the channel W G'rfx- 

• {X°^\Y°°) are defined by 

Xk^Xk + <i>k n^Yk + <Pk, ken (68) 

• For any message point 6*0 £ [0, 1) and any fc e N 

Xk^ gk{9o,Y^-\A'') (69) 
where G ^ {gt. : [0, 1) x X''-^ x A'' ^ A"}^ is a sequence of transmission functions. 

• Ak is statistically independent of {X^^^ ,Y''^'^ ,<^^~^ , A^~^) for any k e N. 

• <^k is statistically independent of {X'^, y''"^ A'') for any fc e N. 

• The following Markov relation holds for any fc G N: 

Yu ^ x'^y''-^ ^ A''^'' (70) 

Loosely speaking, this relation guarantees privacy of randomness resources, namely that the ad- 
versary/channel cannot utilize common randomness shared by the terminals. This is the common 
randomness counterpart of 

• A = {A/j : X'^ X A''^^ I— > is a sequence of decoding rules, such that Ak{Y'^ , A'^~^) is the 
decoded interval at time k. 

For any given channel W S "t^x, feedback transmission scheme (G, T', A) with/without dithering and 
message point 9q £ [0,1), the above construction uniquely determines the joint statistics of (X°°,F°°, 
X°°, ^4°°). The error probability Pe{n, W, 6o) and instantaneous rate i?„(W, Oq) are defined simi- 

larly to Uni), with A„(r", y4"-i) replacing A„(Y", J/""!). The empirical capacity G^^p(>V, 6*0) is defined 
as in ifTTj) . using the realized noise sequence Z°° pertaining to the input/output pair {X°°, Y°°). A scheme 
is said to locally achieve the empirical capacity for a specific channcl/messgae point pair {W,Oo), if 

Pe{n,W,0„) <ei{n), P(i?„(W, ^o) > C^r^CW, ^o) - e2(n)) > 1 ~ e3(^) 

for some ei{n), ei{n), ei{n) — > 0. As in the scheme is said to (uniformly) achieve the empirical capacity 
over a family of channels if the above is satisfied uniformly over W G and Oq G [0, 1). 

Assuming the scheme operates with dithering, let e 'i^x denote the causal channel induced by the 
input/output pair {X°^ ,Y°°), i.e., the channel defined by 

W\G,r,A,W,eo) ^ {wl{yk\x\y''-')=FY,\x>'Y>'-4yk\x\y''-')}^^^ (71) 
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The induced channel depends in general on the transmission scheme and the message point, and is 
therefore not a "true channel" in the regular operational sense. Moreover, generally ^ despite 
the modulo-additive dithering, due to the statistical coupling generated by feedbaclo- Neyertheless, the 
following observation provides an operational meaning to the induced channel. 

Lemma 9. Fix a channel W &'^x and a message point 9q e [0,1). Let = W'' {G,V, A,yV,6o) be the 
corresponding induced channel. The following two statements are equivalent: 

(i) The scheme (G,7^,A) operating with dithering locally achieves the empirical capacity for (yV,0o)j 
with the convergence parameters £i{n),e2{n),e^{n). 

(ii) The scheme {G,V,A) operating without dithering locally achieves the empirical capacity for (W^,9o), 
with the convergence parameters Si{n).,e2in),S3{n). 

Proof. For any feedback transmission scheme operating with or without dithering, the decoded interval 
A„ is a function of {Y^, A"~^), and the rate and error probability are in turn functions of A„. Moreover, 
the realized noise sequence corresponding to {X°°,Y°°) and to {X°°,Y'^) is exactly the same sequence 
due to ([68|l . hence the empirical capacity C^™p(>V, 6'o) is a function of Therefore in general, 

(A„,pe,-Rn,C'™'^) are functions of (X", y", ^""^). Now for case ^ above, the induced channel W'' 
together with 9q and (G, V, A) uniquely defines the joint distribution of {X°°, Y°°, A°°). But by definition, 
this distribution must coincide with the joint distribution of {X°° ,Y°° , A°°) obtained in case (P, concluding 
the proof. □ 

It should be emphasized that the two statements in the Lemma above correspond to two separate 
constructions. The following important observation is due. 

Lemma 10. Let W € "^x, and suppose the scheme (G,7^, A) operates without dithering over the channel 
yVt(G, T', A, yy, 0o) with the message point Oq. Then for any a G A, the indicator sequence {la(^fe)}fcLi 
is a (not necessarily identically distributed) causal sampling sequence for the noise sequence 

Proof. This is an analogue to the statement made in Lemma |6l and the proof is of the same spirit. 
We will prove for the case where (G,7^,A) operates with dithering over W with Oq, and the result will 
follow as in Lemma IHl since the distribution of {Z°°,A°°) under both settings is the same. Clearly, 
la(^fc) ~ Ber(Pfc(a)) and the indicator sequence is not necessarily identically distributed. However as 
we now show, is statistically independent of {Z'^ , A''~^) for any fc G N, from which the result follows 
immediately. Since A°° is a sequence of independent r.v's it is sufficient to show that Z'^ ^ A^~^ ^ A^. 
To this end, we repeat the derivation in (gS]) to the letter, replacing transition justifications <^ and (Jsj) 
with the following (jlf ) and (jcf ) respectively: 

(a*) Zk ^ x^-^Y^-^A^-^ ^ Ak. 

Proof: Again we omit the r.v. subscripts where there is no confusion, vector additions over X'^ are 
taken to be element by element modulo-addition. 

^ ^ [P(0fc)P(^'^-i|x^-\/-\a^-) 

(3/0(6*0, a'') + h + Zk\gk{Oo: y^~\a^) + a;'=-i + /"^ + <^'^'"\ a^ <; 
[P(0'=-i|a;'=-\y'=-\a'^-i) 



!,fc-l^ 



'.ex 



^^Note however that in the special case where W is memoryless, the induced channel is independent of the transmission 
scheme and the message point, is memoryless and modulo-additive, and is obtained by averaging "cyclicly shifted" versions 



of W (see the discussion in the end of section IIIII for the binary case) . 
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Ia3l 



fc-l 



fc-l 



,fc-i 



a 




where transitions are justified as follows: 

(al) is statistically independent of (X''"^ ^''"^ together with §^ and ((69|) . 

(a2) $fe is uniformly distributed, is statistically independent of {X'''^ ,Y''^^ ,^'^^-^), the Markov 
relation (fTD]) and the definition of the channel W. 



Our horizon-free finite-alphabet universal scheme is now easily adapted to use common randomness 
within the framework of this section, as follows. First, active positions are removed (i.e., ba=0). Instead, 
the type of each position and the repetition position information for the update bits are directly provided 
by the control sequence A°°. This is achieved (say) by using an alphabet A = {training, regular}\jNU {0} , 
where numerical values correspond to update positions and determine which update bit is to be transmitted 
using which input pair, taking the place of the F*^" described in subsection lIV.51 Thus, Xk is now generated 
from (6*0, Y^~^ , A^) instead of from (6*0, U^~^). Finally, the sequence of marginal distributions V is suitably 
defined taking into account the removal of active positions (which can only improve the redundancy term). 
Namely, for any position j within the fcth block we have Pj (training) = mkb'f^^ , Pj (regular) ~ l — 2mkb^^, 
and a uniform distribution over the numerical values (sk) which constitute the rest of the support of Pj{-) 
(where Sk = 2\X\ [log (k + 1)] corresponds to the number of update bits, see (|49|)). Given the modifications 
described above, the adapted universal transmission scheme under the new construction, either operating 
with or without dithering, is well defined. 

We are now ready to show that the adapted scheme with dithering achieves the empirical capacity 
over Cx- Clearly, the adapted scheme without dithering is essentially equivalent to the scheme without 
common randomness discussed in previous sections (up to the minor issue of active feedback replaced by 
common randomness) , and by repeating the same proof it is readily verified that it achieves the empirical 
capacity over as well. Moreover, note that the fact that W G was used in that proof solely for 
the sake of Lemma [6l namely to show that the training pattern sequence and each of the update pattern 
sequences, constitute causal sampling sequences for the noise sequence within each block. Now, for a 
given channel W G 'rfx and a message point 9q E [0, 1), suppose the adapted scheme operates without 
dithering over the corresponding induced channel W^(G,V,A,yV,9o) with the same message point Oq- In 
this case, Lemma llOl verifies that under the construction considered in this section, it still holds that the 
training and update pattern sequences constitute causal sampling sequences for the noise sequence within 
each block. Therefore, we conclude that for any W G 'rfx and Bq G [0, 1), the adapted scheme operating 
without dithering locally achieves the empirical capacity for (W^, 6*0). Furthermore, note that although the 
induced channel depends both on the message point and on the channel W, the convergence parameters 
ei(n) , e2(n) , S3(n) do not. Finally, according to Lemma[51 the above implies that the adapted scheme with 
dithering locally achieves the empirical capacity for any pair of channel W € '£x and message point 
6*0 G [0,1), with convergence parameters e\(n),e2(n),e-3,(n) independent of (yV',0o)- Hence, by definition 
this scheme uniformly achieves the empirical capacity over the family and the proof is concluded. 

As discussed in section Hill when operating over '£x a uniform input distribution is essential in order for 
the defined modulo- additive empirical capacity to be meaningful, and in turn achievable. The discussion in 
this section reveals the operational significance of this requirement within the framework of our universal 
scheme. The entire scheme hinges on the ability of a training sample to estimate the empirical capacity 
of a block, and on the capability to reliably transmit update information over any block whose empirical 
capacity is not too small. Roughly speaking, a uniform input distribution (obtained here by dithering) 
guarantees that with high probability, the empirical distribution of the realized noise sequence over an i.i.d. 
sample (i.e., training or update positions) is close to that of the entire realized noise sequence. 
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